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For  each  state  in  the  LR(0)  automaton  of  a  context-free  grammar,  there  exists  a 
language  consisting  of  the  valid  strings  that  might  be  parsed  from  that  state.  Exist- 
ing techniques  for  adding  lookahead  to  an  LR(0)  automaton  do  not  explicitly  consider 
these  "continuation"  languages.  Instead,  existing  techniques  examine  prefixes  of  con- 
tinuation languages  or  some  other  simplification  or  approximation  of  them.  We 
define  continuation  languages  and  describe  an  algorithm  for  computing  them.  We 
then  show  that  continuation  languages  can  provide  lookahead  for  inconsistent  states 
of  an  LR(0)  automaton,  providing  single-symbol  lookahead,  ^-symbol  lookahead,  and 
regular  lookahead,  each  only  where  needed.  We  submit  that  continuation  languages 
provide  a  sound  theoretical  basis  for  the  study  of  LR  parsing. 
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CHAPTER  1 
INTRODUCTION 


The  use  of  LR  parser  generators  is  the  technique  of  choice  when  creating 
parsers  automatically  from  context-free  grammars.  Most  practical  generators  use 
some  simplification  of  LR(l),  such  as  SLR(l),  NQLALR(l)  or  LALR(l).  While  these 
single-symbol  lookahead  techniques  are  quite  powerful,  they  are  inadequate  for 
describing  many  interesting  languages  in  a  natural  way.  Instead,  the  practitioner  is 
forced  either  to  modify  the  language  to  make  it  acceptable  to  the  parser  generator  or 
to  contort  the  grammar  to  make  it  acceptable.  In  those  cases  in  which  the  practi- 
tioner has  jurisdiction  over  the  language,  modifying  it  is  probably  the  wise  choice. 
However,  if  the  language  definition  cannot  be  modified  (which  is  more  frequently  the 
case),  the  user  needs  either  considerable  skill  in  manipulating  grammars  or  a  better 
parser  generation  tool. 

Context-free  grammars  for  "standard"  programming  languages  are  generally 
designed  to  facilitate  parsing  by  an  automatically  generated  parser.  Nonetheless, 
even  in  some  "standard"  languages,  certain  features  are  difficult  to  parse  with 
single-symbol  lookahead.  Consider  the  simplified  productions  for  the  COMMON 
statement  in  Fortran-77  displayed  in  figure  1-1. 

The  basic  problem  with  this  grammar  fragment  is  that  after  an  ElemenLlist ,  a 
comma  might  either  herald  yet  another  Element,  or  indicate  that  the  ElemenLlist  is 
complete  and  herald  another  Named— common .  A  second  symbol  of  lookahead  is 
required  to  determine  whether  (1)  the  comma  is  part  of  a  longer  Element-list,  or  (2) 
the  Element-list  is  complete,  which  implies  completion  of  a  Named— common . 


Common^tmt 

Named-common-list 

Namtd— common-list 

Named-common 

Element-list 

ElemenLJist 

OptionaL-comma 

OptionaLxomma 


''common''  Named-commonJtist 

Named-common-list  OptionaLxomma  Named-common 
Named— common 

7'  Common-name  '/'  Elementd,ist 
ElementJ,ist  ','  Element 
Element 
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Figure  1-1.  Fortran  named  common  grammar. 

This  problem  is  resolved  in  the  grammar  presented  in  figure  1-2.  This  version  is 
slightly  more  complicated  than  the  original.  The  complexity  of  the  new  grammar  is 
immaterial;  of  importance  is  the  time  required  to  convert  the  grammar  to  LALR(l).^ 
Furthermore,  the  actual  Fortran-77  syntax  is  somewhat  more  complicated;  and, 
hence,  the  LALR(l)  conversion  would  be  more  time-consuming. 


Common^tmt 
Named— common 
Named— no— comma 
Named— common-list 
Named-commonJ,ist 
Element-list 
Element-list 
Optional-comma 
OptionaLxomma 


''common''  Named— common-list  Named-no— comma 
7'  Common-name  7'  ElemenLJist  OptionaL-comma 
7'  Common— name  7'  ElemenLJist 
Named— common~Jist  Named-common 

ElemenLJist  ','  Element 
Element 


Figure  1-2.  Fortran  named  common  grammar  requiring  single-symbol  lookahead. 


Since  Fortran  is  an  old  language,  one  might  expect  that  newer  languages  would 

be  designed  with  single-symbol  lookahead  in  mind.  While  this  is  generally  true,  the 

subprogram  specification  syntax  of  the  Ada  language,  for  example,  is  difficult  to 

express  in  an  LALR(l)  grammar  [1].  It  may  be  acceptable  for  compiler  writers  to 

^  The  conversion  required  about  2  hours  for  the  author  (who  has  some  skill)  to  perform.  A 
novice  would  find  the  prospect  daunting  to  say  the  least. 
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convert  a  grammar  to  LALR(l),  but  it  would  be  preferable  to  use  an  automatically 
generated  parser  that  requires  the  minimum  amount  of  lookahead  at  each  stage  of 
the  parse.  Such  a  parser  could  be  as  efficient  as  its  LALR(l)  counterpart  in  all 
single-symbol  lookahead  instances,  with  only  a  small  degradation  in  performance  for 
the  small  proportion  of  the  language  that  requires  further  lookahead. 

A  grammar  requiring  arbitrarily  long  lookahead  is  shown  in  figure  1-3.  This 
grammar  is  derived  from  an  example  by  Culik  and  Cohen  [2]  and  is  a  typical  gram- 
mar derived  from  using  syntax  macros  in  extendible  languages.  This  language  con- 
tains comparisons  of  both  arithmetic  and  set  expressions.  Arithmetic  and  set  opera- 
tors are  identical.  Likewise,  arithmetic  and  set  variables  and  constants  are  indistin- 
guishable. The  only  distinguishing  feature  is  the  comparison  operator.  Set  comparis- 
ons are  identified  by  the  equivalence  symbol,  "=",  while  arithmetic  comparisons  are 
identified  by  the  equals  sign,  "  =  ". 

S  -►AE=AE  S  -►SE=SE 

AE-^AE-I-AT  SE-^SE-hST 

AE     AE  -  AT  SE      SE  -  ST 

AE-^AT  SE-»>ST 

AT     AT  *  AF  ST     ST  *  SF 

AT-^AF  ST-*-SF 

AF-^id  SF-^id 

AF  — ►  const  SF  —*■  const 

Figure  1-3.  Arithmetic  expression/set  expression  grammar. 

Strings  in  the  language  defined  by  the  arithmetic  expression/set  expression 
grammar  cannot  be  parsed  using  any  fixed  number  of  lookahead  symbols.  Instead,  it 
is  necessary  to  look  arbitrarily  far  ahead  until  either  "='  or  "  =  "  is  reached.  Only 
then  is  it  known  which  production  for  S  is  applicable  and,  hence,  how  to  parse  the 
entire  expression.  From  these  examples,  we  see  that  there  is  a  need  for  parsing  tech- 
niques that  use  multiple-symbol  and  even  arbitrarily  long  lookahead  strings. 
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In  this  dissertation  we  introduce  the  notion  of  a  "continuation"  language,  con- 
sisting of  the  set  of  all  possible  strings  that  might  "validly"  be  parsed  from  an  item 
of  a  given  LR(0)  state.  LR(0)  states  are  collections  of  items  of  the  form  A— >a./?, 
where  A— ►a^  is  a  grammar  production.  In  an  LR(0)  item,  the  "."  indicates  a  possi- 
ble position  to  be  used  in  parsing  a  production.  For  an  LR(0)  state  q  with  an  item 
A—*-a.^,  the  continuation  language  contains  all  strings  that  could  follow  this  a  in  an 
LR(0)  parse  of  a  sentence  of  the  language,  provided  that  state  q  has  been  reached. 

Existing  techniques  for  adding  lookahead  to  an  LR(0)  parser  do  not  explicitly 
determine  the  complete  continuation  language  for  each  item  of  an  inconsistent  state. 
Instead,  lookahead  computation  typically  proceeds  until  either  the  conflict  is 
resolved,  or  lookahead  limits  are  exceeded.  It  is  our  thesis  that  an  analysis  of  the 
continuation  languages  yields  additional,  profitable  variations  of  the  LR  parsing  tech- 
nique. One  such  variation  consists  of  simplifying  continuation  languages  to  form  reg- 
ular expressions,  used  then  to  achieve  arbitrary-length  regular  lookahead. 

In  Chapter  2  we  discuss  the  historical  development  of  lookahead  techniques  for 
LR  parsing.  In  Chapter  3  we  define  the  terms  used  in  subsequent  chapters.  Next,  in 
Chapter  4,  we  present  the  NLR(O)  automaton  from  which  we  will  define  continua- 
tion languages.  Subsequently,  in  Chapter  5,  we  develop  a  useful  definition  of  con- 
tinuation languages.  After  formulating  this  definition,  we  present  an  algorithm  for 
computing  continuation  languages  in  Chapter  6.  In  Chapter  7  we  present  several 
improvements  to  this  continuation  algorithm.  Next,  in  Chapter  8,  we  show  how  to 
use  continuation  languages  for  finite  and  arbitrary  lookahead.  Finally,  in  Chapter  9, 
we  present  conclusions  and  discuss  possible  future  research  directions. 


CHAPTER  2 
fflSTORICAL  BACKGROUND 


Knuth's  research  on  LR(^)  parsing  [3]  forms  the  foundation  upon  which  much  of 
modern  parsing  theory  has  been  built.  Knuth's  LR(Ar)  technique  provided  an 
automated  method  for  creating  a  table-driven  parser  from  a  grammar.  Unfor- 
tunately, the  tables  defining  the  parsers  were  too  large  to  be  of  practical  benefit. 

Korenjak  [4]  devised  a  method  for  constructing  LR(fc)  parsers  with  fewer  states. 
Since  the  number  of  states  of  an  LR(A;)  parser  can  grow  exponentially  with  respect  to 
the  size  of  the  grammar,  Korenjak  partitioned  the  grammar  into  several  smaller 
grammars  and  created  a  parser  that  would  utilize  parse  tables  for  the  sub-grammars 
as  needed  to  parse  strings  of  the  target  language.  The  major  drawback  of  this  tech- 
nique is  that  the  partitioning  must  be  done  by  hand.  Even  so,  Korenjak  used  the 
technique  to  create  an  LR(1)  parser  for  Algol. 

Another  LR(A:)  parser  construction  technique  was  devised  by  Pager  [5].  His 
technique  consisted  of  merging  "compatible"  states  as  they  were  formed  during  the 
LR(^)  construction  process.  By  merging  compatible  states,  the  parse  tables  were 
smaller  and  the  construction  process  required  less  time.  Pager  also  devised  a  method 
for  splitting  states  of  an  LR(0)  machine  [6]  by  analyzing  the  possible  paths^  leading 
to  each  item  of  an  inconsistent  LR(0)  state.  His  technique  consisted  of  tracing  back- 
wards on  each  of  these  paths  xmtil  reaching  a  state  that  needed  to  be  split.  It  is 
interesting  to  note  that  Pager  apparently  made  use  of  the  same  machine  described 
later  in  this  dissertation  as  the  NLR(O)  machine.  More  recently,  Spector  has  devised 
^  Pager  referred  to  such  paths  as  "lanes." 
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another  state-splitting  scheme  [7],  but  his  technique  is  limited  to  single-symbol  look- 
ahead. 

Practical  use  of  the  LR(^)  theory  did  not  occur  until  DeRemer  discovered  the 
SLR(^)  [8]  and  LALR{k)  [9]  techniques.  Both  SLR(^)  and  LALR(^)  add  determin- 
ism to  an  LR(0)  parser  by  using  Ar-symbol  lookahead  strings  to  determine  which 
move  to  make  in  an  inconsistent  LR(0)  state.  They  differ  in  the  manner  in  which 
the  lookahead  strings  are  calculated.  SLR(^)  ignores  the  context  implied  by  the 
state,  while  LALR(^)  uses  this  context  to  calculate  the  lookaheads  as  accurately  as  is 
possible,  given  the  LR{0)  machine.  Nonetheless,  they  are  both  used  (generally  with 
A;  =  1)  to  produce  efficient  parsers  for  a  wide  class  of  grammars. 

There  have  been  several  extensions  to  the  LR(A:)  theory  permitting  arbitrary 
lookahead.  The  class  of  LR-Regular  grammars  [2]  augments  an  LR(0)  parser  with  a 
right-to-left  pre-scan  of  the  input  to  categorize  the  input  stream  into  several 
equivalence  classes.  Each  equivalence  class  is  a  regular  set  that  discriminates  among 
the  various  alternatives  for  inconsistent  LR(0)  states.  During  the  right-to-left  pre- 
scan,  each  token  is  augmented  with  its  equivalence  class.  Knowledge  of  the 
equivalence  class  of  each  token  permits  an  LR(0)  parser  to  correctly  parse  the  refined 
input  string  in  a  left-to-right  scan.  The  LR-Regular  grammars  form  the  largest  class 
of  grammars  for  which  regular  lookahead  can  resolve  inconsistencies  in  an  LR(0) 
parser.  Consequently  the  LR-Regular  grammars  contain  the  LR{k)  grammars,  but 
there  is  no  simple,  automated  technique  for  determining  which  regular  sets  to  use  for 
a  particular  LR-Regular  grammar.  Indeed,  the  existence  of  such  a  collection  of  regu- 
lar sets  for  a  given  grammar  is  undecidable  [2]. 

The  XLR  parsing  technique  [10]  is  a  practical,  automated  technique  for  adding 
arbitrary,  regular  lookahead  to  an  LR(0)  parser.  The  process  works  by  adding 
"reduce-arcs"  to  an  LR(0)  parser  to  indicate  where  a  shift  move  might  lead  after  a 
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series  of  one  or  more  reduce  moves.  Each  reduce-arc  is  labeled  by  the  initial  reduc- 
tion in  the  series  and  the  terminal  symbol  leading  to  the  destination  state.  When  an 
inconsistent  LR(0)  state  is  reached  in  parsing  an  input  string,  the  XLR  parser  uses  a 
function  named  "resolve"  to  determine  which  of  the  conflicting  actions  is  correct. 
Resolve  runs  the  finite  state  machine  composed  of  1)  the  underlying  LR(0)  finite 
state  machine  along  with  2)  the  added  reduce-arcs,  on  the  remaining  input.  The  exe- 
cution of  this  "composite"  finite  state  machine  continues  imtil  the  end  of  the  input  is 
encountered  or  until  all  but  one  of  the  original  LR(0)  actions  have  been  ruled  out. 

Each  reduce-arc  in  an  XLR  parser  is  shown  to  be  "optimal"  in  the  sense  that 
the  XLR(l)  grammars  are  exactly  the  same  as  the  LALR(l)  grammars.  However,  in 
cases  requiring  additional  lookahead,  the  XLR  technique  does  not  compare  so  favor- 
ably with  LALR(^).  Indeed,  Baker  gives  an  example  of  a  grammar  that  is  SLR(2), 
and  therefore  LALR(2),  but  not  XLR(oo).  On  the  other  hand,  the  XLR  technique 
can  parse  some  grammars  requiring  arbitrarily- long  lookahead. 

The  LAR(m)  technique  [11]  is  another  practical,  automated  technique  for 
adding  arbitrary,  regular  lookahead  to  an  LR(0)  parser.  It  achieves  regular  look- 
ahead  by  attaching  a  finite  state  lookahead  automaton  (LAA)  to  each  inconsistent 
state  of  an  LR(0)  parser.  The  states  of  the  LAA  are  composed  of  items  that  are 
similar  to  LR(fc)  items,  but  the  items  contain  the  most  recently  traversed  m  states. 
Each  item  also  has  an  associated  action,  which  is  one  of  the  conflicting  actions  of  the 
inconsistent  LR(0)  state.  The  LAA  scans  forward  in  the  remaining  input,  simulating 
all  possible  actions  of  the  LR(0)  parser  for  each  input  symbol  until  a  state  in  the 
LAA  is  reached  which  has  only  one  associated  action  in  the  LR(0)  parser.  Such 
states  are  final  LAA  states,  and  reaching  one  signals  the  correct  move  to  the  LR(0) 
parser. 
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The  LAR(m )  grammars  form  a  proper  subset  of  the  LR-Regular  grammars.  In 
fact,  this  property  holds  true  regarding  any  set  of  grammars  parsable  by  adding  regu- 
lar lookahead  to  an  LR(0)  parser.  It  has  been  shown  that  if  a  grammar  is  LALR(jfc), 
then  it  is  LAR(m)  for  some  m.  Further,  if  a  grammar  is  XLR(A;),  it  is  also  LAR(m) 
for  some  m  [12]. 

All  these  techniques  are  similar  in  that  they  do  not  explicitly  calculate  the  con- 
tinuation languages.  Instead  they  simplify  the  lookahead  in  some  fashion.  The 
SLR(A;)  and  LALR(A;)  techniques  limit  the  lookahead  to  k  symbols;  while  the  LR- 
Regular,  XLR  and  LAR  techniques  use  regular  lookahead.  Since  continuation 
languages  can  be  arbitrary  context-free  languages  (shown  later  in  this  dissertation), 
simplification  of  the  lookahead  is  inevitable  for  any  technique  using  either  finite  or 
regular  lookahead. 

In  this  dissertation  we  present  a  new  technique  for  computing  lookahead  for  LR 
parsers.  Our  technique,  like  SLR(A;),  LALR(A;),  LR-Regular,  XLR  and  LAR,  consists 
of  adding  lookahead  to  an  LR(0)  parser.  The  difference  is  that  we  calculate  the  con- 
tinuation languages  for  inconsistent  LR(0)  states  and  determine  lookahead  from  these 
languages. 


CHAPTER  3 
BACKGROUND  AND  TERMINOLOGY 


Here  we  briefly  discuss  graphs,  strongly  connected  components,  context-free 
grammars,  finite-state  automata,  regular  expressions  and  LR(0)  parsers.  A  detailed 
discussion  of  graph  theory  is  presented  by  Aho,  Hopcroft  and  Ullman  [13],  while  an 
introduction  to  formal  language  theory  is  given  by  Hopcroft  and  Ullman  [14]. 

Terms  from  Graph  Theory 

A  graph  is  an  ordered  pair  G={V,E)  where  V  is  a  finite,  non-empty  set  of 
vertices  and  J^C^X  F  is  a  set  of  pairs  of  vertices  called  edges  .  In  general,  graphs 
may  be  considered  to  be  either  directed  or  undirected.  A  directed  graph  is  a 
graph  in  which  the  edges  are  ordered  pairs,  while  an  undirected  graph  is  a  graph 
in  which  the  edges  are  pairs  with  no  concern  for  the  order.  Henceforth  in  this  disser- 
tation, the  term  graph  will  always  imply  a  directed  graph. 

In  a  graph  G  ={V,E),  if  {v,w)  is  an  edge,  then  we  say  that  w  is  adjacent  to  v 
and  we  also  say  that  the  edge  {v,w)  is  from  v  to  w.  A  path  is  a  sequence  of  edges 
(^i>^2)'('^2»^3)'  ■  ■  ■  )(^n-i»^n)  usually  represented  by  listing  the  sequence  of 

vertices  in  the  path  {vi,V2,  ■  •  •  ,v„).  The  path  is  from  Vi  to  i;„  and  has  length  n-1. 
If  there  is  a  path  from  v  to  we  may  state  that  w  is  reachable  from  v  and  that  v 
is  an  ancestor  of  w.  Note  that  a  path  with  0  edges  exists  from  any  vertex  v  to 
itself  and  that  v  is  an  ancestor  of  itself.  The  reversal  of  a  path  {vi,V2,  •  •  ■  ,v„), 
denoted  by  {vi,V2,  ■  ■  ■  is  the  path  (v„,u„_i,  •  •  •  A  cycle  is  a  path  in 
which  v„  =i»i. 


g 
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A  flow  graph  G  =  {V,E,s)  is  a  graph  with  a  distinguished  start  vertex  s 
such  that,  for  any  vertex  veV,  there  is  a  path  from  s  to  v.  In  a  flow  graph 
G  ={V,E,s),  a  vertex  v  is  said  to  dominate  a  vertex  w  if  v^w  and  every  path 
from  s  to  w  contains  v.  The  dominator  tree  for  a  flow  graph  is  a  tree  in  which  a 
vertex  v  is  a  proper  ancestor  of  w  if  and  only  if  v  dominates  w  in  the  flow  graph 
[15]. 

The  inverse  of  a  graph  G  =  {V,E)  is  the  graph  G'={V,E')  where 
E'={(v,w)\{w,v)eE}.  Given  two  graphs  Gi  =  {Vi,Ei)  and  G2=(V^2»^2X  is  a 
subgraph  of  if  T^iCFg  and  E1QE2.  If  £^1  =  {{x,y)\{x,y)eE2  and  z.j/eFj}, 
then  we  say  that      is  the  subgraph  of  G2  induced  by  V^. 

A  graph  is  connected  if  for  any  two  vertices  v  and  w,  there  is  a  path  from  v 
to  w  or  a  path  from  w  to  v.  A  graph  is  strongly  connected  if  for  any  two  ver- 
tices V  and  w,  there  is  a  path  from  v  to  w  and  a  path  from  «;  to  v.  The  vertices  of 
any  graph  G={V,E)  can  be  partitioned  into  a  set  of  equivalence  classes  V^,  with 
l<i<k,  such  that  two  vertices  v  and  w  are  equivalent  if  and  only  if  there  is  a  path 
from  v  to  «;  and  a  path  from  w  to  v.  The  graph  G',  =(y,' ,£■,),  where  E^  is  the  set  of 
edges  from  E  that  connect  vertices  of  F,-,  is  called  a  strongly  connected  com- 
ponent (sec)  of  G.  Note  that  a  strongly  connected  component  is  a  maximal  sub- 
graph that  is  strongly  connected. 

The  condensation  of  a  graph  G={V,E)  is  the  graph  G'={V',E\  where  the 
vertices  of  V  are  the  strongly  connected  components  of  G  and  (T^,-,Vy)  is  in  E'  if 
and  only  if  «Vy  and  there  exists  some  v,  eF,-  and  some  vyG  Vy  such  that  {vi,Vj)eE. 
The  condensation  of  a  graph  is  an  acyclic  graph  that  has  an  edge  from  one  com- 
ponent to  another  if  there  is  an  edge  from  a  vertex  of  the  first  component  to  a  vertex 
of  the  second  component  in  the  original  graph. 
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Terms  from  Formal  LanRuage  Theory 

An  alphabet  T  is  a  finite  set  of  symbols.  A  finite  sequence  of  symbols  from  T 
is  called  a  string  .  The  empty  string,  denoted  by  e,  consists  of  zero  symbols. 

The  concatenation  of  two  strings  is  the  string  formed  by  writing  the  second 
string  immediately  following  the  first.  The  concatenation  of  strings  x  and  y  is  writ- 
ten as  xy . 

A  language  is  a  set  of  strings  derived  from  some  alphabet  T.  The  empty  set 
is  the  smallest  language  of  strings  over  T.  The  largest  language  of  strings  over  T  is 
the  set  of  all  strings  over  T,  which  is  denoted  by  T*. 

The  imion  of  two  languages  R  and  S  is  denoted  R+S.  The  concatenation  of 
two  languages  R  and  S,  denoted  RS,  is  the  language  consisting  of  every  possible 
concatenation  of  a  string  of  R  with  a  string  of  5. 

R+S=RUS 

RS  =  {rs\reR  and  s£S} 

For  a  language  L,  we  define  an  infinite  set  of  languages  L'  by  defining  L^={€} 
and  L'=LL'~^.  The  reflexive,  transitive  closure  (under  concatenation)  of  a 
language  L ,  denoted  L  *,  is  the  set 

00 

L*=\JL' 
«-o 

For  an  alphabet  T,  the  regular  expressions  over  T  and  the  languages  that  they 
denote  are  defined  recursively  as 

1)  The  empty  set  is  denoted  by  the  regular  expression  <f>. 

2)  The  set  {e}  is  denoted  by  the  regular  expression  e. 


12 


3)  For  each  r  €  T,  r  is  a  regular  expression  denoting  {r}. 

4)  If  r  and  s  are  regular  expressions  denoting  the  languages  R  and  5,  then 

a)  (r+s)  is  a  regular  expression  denoting  5; 

b)  (rs)  is  a  regular  expression  denoting  RS;  and 

c)  (r*)  is  a  regular  expression  denoting  R*. 

The  regular  expression  definition  uses  parentheses  to  indicate  the  operand(s)  of 
an  operation.  The  usual  precedence  relations  in  which  closure  outranks  concatena- 
tion and  concatenation  outranks  union  will  be  used  to  reduce  the  number  of 
parentheses  in  regular  expressions. 

A  regular  expression  r  is  simple  if  either  r  =<^  or  else  ^  is  not  a  subexpression 
of  r.  It  is  possible  to  transform  any  regular  expression  into  an  equivalent  simple  reg- 
ular expression  by  performing  the  following  steps  until  none  of  them  are  applicable: 

1)  replace  any  subexpression  of  the  form  r(f)  or  (f>r  by  (fr, 

2)  replace  any  subexpression  of  the  form  {r-\^)  by  r;  and 

3)  replace  any  subexpression  of  the  form  ^*  by  e. 

A  regular  expression  r  denoting  the  set  R  is  unambiguous  if  r  represents  each 
string  in  R  imiquely.  This  is  defined  recursively  as  follows: 

1)  The  expressions  </»,  e  and  a  for  each  a  6  T  are  unambiguous. 

2)  If  r  and  s  are  tmambiguous  regular  expressions  denoting  the  languages  R 
and  S,  then 

a)  ( r  -|-s )  is  unambiguous  if  ^  D  5  =  <^ 

b)  (rs)  is  unambiguous  if  for  each  weRS,  there  is  exactly  one  decompo- 
sition w  =  WiW2  where  WjGi?  and  ^2^*^; 

c)  (r*)  is  unambiguous  if  r  j^e  and  if  for  each  w  eR*,  there  is  exactly  one 
decomposition  w  =  WiW2  "  '  '      ^'^^^  for  0<»  <n. 
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A  context-free  grammar  (CFG)  is  a  quadruple  G  ={N,T,P,S),  where  A'^  and 
T  are  finite  disjoint  sets  of  symbols  called  nonterminals  and  terminals  respectively,  S 
is  a  symbol  from  N  called  the  start  symbol,  and  P  is  a  set  of  productions  of  the  form 
A—*-a  where  ae{NUT)*.  In  this  dissertation  all  CFGs  will  be  implicitly  augmented 
by  adding  the  new  symbols  S'  and  _L  and  the  production  S'—*-SA.  forming 
G'={NU{S'},TU{±},PU{S'-*S±},S').  The  symbol  "±"  (pronounced  "chicken 
foot")  is  be  used  to  indicate  the  end  of  the  input. 

An  extended  context-free  grammar  (ECFG)  [16]  is  a  quadruple 
G  ={N,T,P,S),  where  A'^,  T,  and  S  are  the  same  as  for  a  CFG  and  P  is  a  subset  of 
NX{NUT)*.  An  ECFG  production  is  similar  to  a  CFG  production  except  that  the 
right-hand  side  is  a  regular  expression  over  the  alphabet  A'^U  T. 

The  following  (usual)  conventions  will  hold: 

A,B,C,  e  N 

■  ■  ■  ;k,y,z    g  nut 

t,a,b,c,    •  ■  E  T 

•  ■  ■  ,x,y,z  e  T* 

aAl---  e{NUTf 

e  empty  string 

The  productions  of  a  CFG  (or  ECFG)  G  induce  a  natural  relation  on  the  strings 
in  (NUT)*.  We  write  a=^p  (pronounced  "a  derives  /?"")  if  there  exist  symbols  A,  7, 
6,  and  wsuch  that  a=^A6,  ^=yjj6,  and  A—*(jjeP.  The  reflexive,  transitive  closure 
of  "=^"  is  denoted  by  "=**".  A  derivation  of  a  string  a  is  a  sequence  of  strings 
%  ,q;i  ,  •  •  •  ,Q!„ ,  where  Q^=5,  q;„  =a,  and,  for  0<j  <  n,  a,-  =^Q!,+i.  For  the  purposes 
of  this  dissertation,  all  derivations  will  be  right-most;  that  is,  in  every  step  of  a 
derivation  the  right-most  nonterminal  is  replaced  by  the  right-hand  side  of  a  produc- 
tion. 

The  language  defined  by  a  CFG  G,  denoted  L{G),  is  the  set  of  all  strings 
ar  e  T  such  that  5  =4  x.  Each  string  m  L{G)  is  called  a  sentence  of  the  language. 
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For  a  CFG  G  ={N,T,P,S)  and  a  nonterminal  AeN,  we  define  the  grammar 
G'yi=(^U.^x  where  N^={BeN\A=^*aB/3},   Ty^={teT\  A  ^*at/3}  and 

Pj^={B—*-uJ€P  \  B eNj^}.  We  also  refer  to  the  language  defined  by  a  nonterminal 
symbol  A,  denoted  L{A).  We  define  L{S)  to  be  L{G);  while  for  a  nonterminal 
A  7^5,  is  L{Gj^).  Similarly,  we  can  define  to  be  {<}  for  a  terminal  symbol 
t. 

We  extend  the  domain  of  L  to  include  regular  expressions  from  (A^UT)  in  the 
obvious  fashion: 

1)  L{al3)  =  L{a)L{l3). 

2)  L(a+;^=L(a)+L(/?). 

3)  L{a)=L{a)\ 

A  deterministic  finite-state  automaton  (DFA)  is  a  quintuple 
M ={Q,T ,A,qQ,F),  where  Q  is  a  set  of  states,  T  is  a  set  of  symbols,  A:QXT—*^Q 
is  a  partial  function  defining  the  transitions  from  one  state  to  another,  is  the 
start  state  of  M,  and  FQQ  is  the  set  of  final  states. 

A  nondeterministic  finite-state  automaton  (NFA)  is  a  quintuple 
M  =  {Q ,T ,A,qQ,F),  where  Q  is  a  set  of  states,  T  is  a  set  of  symbols, 
Zi:QX(T'U{e})— »-2^  defines  the  transitions  from  one  state  to  a  set  of  states,  is 
the  start  state  of  M,  and  FC.Q  is  the  set  of  final  states.  Note  that  for  an  NFA,  A 
yields  a  set  of  states  (rather  than  a  single  one),  and  that  transitions  on  e  are  possible. 

It  is  well  known  that  the  languages  recognized  by  finite-state  automata  are  reg- 
ular [17].  Fast  algorithms  for  converting  finite-state  automata  to  regular  expressions 
are  described  by  Tarjan  [18].  Tarjan's  algorithms  solve  a  more  general  class  of  prob- 
lems referred  to  as  "path  expression  problems,"  which  will  be  useful  to  us  later. 
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Given  a  graph  G  ={V,E),  any  path  in  G  is  a  sequence  of  edges.  This  sequence 
of  edges  forms  a  string  over  E.  A  path  expression  P  of  type  {v,w)  is  a  simple 
regular  expression  over  E  such  that  every  string  denoted  by  P  is  a  path  from  v  io  w. 

Given  a  graph  G={V,E)  and  a  particular  vertex  s,  the  single- 
source  path  expression  problem  is  the  problem  of  computing,  for  each  vertex 
vEV,  an  imambiguous  path  expression  P{s,v)  such  that  every  path  from  s  to  v  in 
G  is  contained  in  P{s,v)  and  P{s,v)  contains  only  paths  from  s  to  v. 

An  LR(0)  parser  for  a  CFG  G  is  a  quintuple  LRq={G ,Q,qQ,Next, Reduce), 
where  Q  is  a  finite  set  of  parse  states,  "is  the  start  state  of  LRq,  Next  is  the  transi- 
tion function  for  the  characteristic  automaton  CA  ={Q, NUT, Next, qQ,Q),  and 
Reduce '.Q—*^2^  is  the  reduce  function.  LRq  is  a  pvish-down  automaton  which 
operates  by  maintaining  a  stack  of  states  from  Q,  and  performing  moves  based  on 
the  top  stack  symbol  and  (possibly)  the  next  terminal  symbol  from  the  input  string. 

An  LR(0)  item  is  a  pair  (p,i),  where  pGP  is  a  production  and  i  indicates  a 
position  for  a  "dot"  in  the  right-hand  side  of  p.  The  dot  indicates  how  much  of  the 
production  has  been  parsed  and  can  vary  from  zero  to  the  length  of  the  right-hand 
side  of  the  production. 

Using  notation  derived  from  DeRemer  and  Pennello  [9],  the  LR(0)  parser  is 
defined  as  follows  (with  X  =g  F{X)  meaning  that  X  is  the  smallest  set  satisfying 
X  =  F{X)): 

Q  =g  q q\J {Closure {r)\  r  e Successors {q),q  sQ} 
?o  =  Closure{{S'->'.S±}) 
Closure{r)  =,  rU{A-v.w|  BB-^-a.A l3e  Closure (r)} 
Successors{q)  =  {Nucleus {q,x)\  35— ►a.x/?Gg} 
Nucleus{q,x)  =  {B—*-ca.^\B—*'a.x^eq} 
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Next{q,x)  =  Closure  {Nucleus [q  ,x)) 
Reduce  (q)  =  {A— ►wlA— ►w.e?} 

A  conflguration  of  an  LR(0)  parser  is  denoted  by  [oi]z,  where  o;  is  the  parse- 
time  state  stack  and  z  is  the  remaining  input  to  be  parsed.  Since  CA  is  determinis- 
tic, we  take  the  liberty  of  representing  some  states  of  the  parse  stack  by  the  gram- 
mar symbol  labeling  the  transition  to  the  state.  LR(0)  moves,  defined  by  relation 
"I — "  (pronounced  "moves  to"),  are  defined  as  follows: 

shift:  [a]tz  \-  [at]z  ^  Next{Top  [a],t)  is  defined; 

reduce:  [au^z\—[o!A]z  ^  A—*'UJeReduce{Top[au]). 

Note  that  CA  for  G  is  a  DFA;  while  LRq  is  nondeterministic  if  in  any  state 
either  of  the  following  occurs:  1)  a  shift  and  a  reduce  move  both  apply;  2)  more  than 
one  reduce  move  applies.  An  LRq  state  with  more  than  one  possible  move  is  called 
inconsistent. 

A  terminal  string  x  accesses  state  q  if  LRq  can  reach  state  q  from  qQ  by  a 
sequence  of  LRq  moves,  in  which  the  shift  moves  consume  all  the  symbols  of  z.  In 
such  a  case,  x  is  called  an  accessor  of  q . 


CHAPTER  4 
THE  NLR(O)  AUTOMATON 


The  states  of  an  LR(0)  automaton  consist  of  sets  of  items,  and  its  moves  deter- 
ministically  model  sequences  of  transitions  between  items.  It  is  useful  to  consider 
these  transitions  between  items  in  the  context  of  the  LR(0)  automaton.  We  intro- 
duce the  term  NLR(O)  machine  for  the  nondeterministic  machine  derived  from  the 
LR(0)  machine  by  considering  each  (item,state)  pair  as  a  state.  There  are  two  basic 
types  of  transitions  in  the  NLR(O)  machine  which  are  derived  from  LR(0)  state  clo- 
sures and  LR(0)  transitions.  For  an  LR(0)  state  q  with  an  item  A  —*-a.B^,  there  is 
an  e-transition  from  {A—*a.Bl3,q)  to  {B—*.8,q)  for  each  production  B—*-6.  Addition- 
ally, for  each  LR(0)  transition  from  9  to  r  on  symbol  X,  there  will  be  a  transition 
from  {A—*^a.X/3,q)  to  {A—*-aX.0,r)  for  each  appropriate  pair  of  states  from  the 
NLR(O)  machine. 

An  NLR(O)  parser  may  be  constructed  from  an  NLR(O)  automaton  in  much 
the  same  manner  as  an  LR(0)  parser  is  built  using  an  LR(0)  automaton.  An  NLR(O) 
parser  for  a  CFG  G  is  a  quintuple  NLRq^{G ,Q,qQ,Next, Reduce),  where  Q  is  a  fin- 
ite set  of  NLR(O)  parse  states,  qQ  is  the  NLR(O)  start  state.  Next  is  the  NLR(O)  tran- 
sition function,  and  Reduce:  Q—*-2^  is  the  reduce  function.  The  primary  difference 
between  an  NLR(O)  parser  and  an  LR(0)  parser  is  that  the  NLR(O)  automaton  is 
nondeterministic. 

A  conflguration  of  an  NLR(O)  parser  is  denoted  by  [a]z,  where  a  is  the  parse 
stack  consisting  of  NLR(O)  states  and  z  is  the  remaining  input  to  be  parsed.  As  we 
did  with  the  LR(0)  configurations,  we  will  sometimes  represent  states  from  the  parse 
stack  by  grammar  symbols.  However,  since  the  NLR(O)  machine  is  nondeterministic. 
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more  than  one  configuration  might  validly  represent  the  state  of  the  NLR(O)  parser. 
NLR(O)  moves,  denoted  by  relation  "  |—  ",  are  defined  as  follows: 

shift:  [a]tz\-[aq]z  ^  qeNext{Top{a),t); 

£-shift:  [a]2h-[o!9]-2  ^  q  e  Next  {Top  {a),e); 

reduce:  [af^z  \—  [ocq]z  ^  the  path  ^ spells  oo,  A—*-u>eP,  and 
qe  Next  {Top  {a),A). 

A  terminal  string  x  accesses  an  NLR(O)  state  (A— ►a./?, 9)  if  x  accesses  LR(0) 
state  q.  In  that  case  we  say  that  x  is  an  accessor  of  (A— ►a./?, 9). 

For  an  example  of  an  NLR(O)  automaton,  consider  grammar  Gl  in  figure  4-1. 
Its  LR(0)  automaton  is  shown  in  figure  4-2,  and  its  NLR(O)  automaton  is  shown  in 
figure  4-3.  In  figure  4-3  each  NLR(O)  state  is  depicted  with  a  number  from  0  to  29. 
This  number  appears  first  in  each  state  box  and  is  followed  by  the  item  and  the 
LR(0)  state  number. 


Now  consider  parsing  the  string  "aaaJ_"  using  the  LR(0)  and  the  NLR(O)  auto- 
mata for  grammar  Gl.  First  consider  the  LR(0)  parse,  shown  in  figure  4-4.  Two 
reductions  are  possible  in  state  8.  In  this  parse  we  consider  only  the  appropriate 
reduction  without  indicating  how  to  select  it. 


S  -*aB  b  c 
S  — ►aD  a 
S  — ►  b  B  a  c 
S  -►bD  b 


A  — ►  a 
D  -►a 


Figure  4-1.  Grammar  Gl. 
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Figxire  4-2.  LR(0)  automaton  for  grammar  Gl. 


Step 

Stack 

Input 

Action 

1 

0 

aaa_L 

shift  a 

2 

03 

aa_L 

shift  a 

3 

038 

aJ_ 

reduce  D— ^ 

4 

03  9 

a± 

shift  a 

5 

0  3  9  13 

± 

reduce  S— >aDa 

6 

0  1 

± 

shift  ± 

7 

0  14 

accept 

Figure  4-4.  LR(0)  parse  of  "aaaJ.". 


Next  consider  the  NLR(O)  parse  of  the  same  string,  in  figure  4-5.  The  NLR(O) 
parse  of  "aaai."  has  been  simplified  by  showing  only  the  correct  moves.  There  are 
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I  0  S'-..SX,0 

J,   

I  1  S->  bPb.O 

'l  2  S-».Mac,0 

1  3  S-».aDa.O 

1  4  S-».aBbc,0 


*!  g  S-»b.Db,2 

1  7  S->b.Btc,2 

'l  e  D->.t,2 

1  9  B->.A.2 


-^1  11  A^->AX..T1 


-H  la  s-«<)D.b.s 


-»)  18  S->bDb.  .11 


'l  18  S-»bB.ac.7  I  ^20  S-»bB».c.l2|  >t26  S-»bBac..lS| 


1     21  B->A.  ,8  I 


*!  22  !>-»«.  .8  I 
'l    23A->a..8  I 


*1  15  S-Kl.Bbc.3  1 
1     16  B-».A,3  1 

*|     17  D-».».3  1 

-H  24  S->aP.a,8  I  »j  27  S-^a.  .13 


'|2S  S-»aB.bc.lO|  >t28  S-»«Bb.c.l4|  »|29  S-»aBbc..l6| 


Figure  4-3.  NLR(O)  automaton  for  grammar  Gl. 


three  possible  e  transitions  from  state  0,  but  only  one  is  shown.  Having  made  this 
correct  choice,  the  reduction  hy  A — *-a  is  never  considered.  It  is  interesting  to  con- 
trast the  reduce  actions  for  the  two  parsers.  In  the  LR(0)  parser,  the  length  of  the 
right  hand  side  of  the  production  determines  how  many  states  are  popppd  from  the 
stack  prior  to  advancing  on  a  nonterminal  transition.  In  the  NLR(nj  parser,  one 
additional  state  must  be  popped.   This  moves  the  parser  back  to  the  state  from 
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which  the  £- transition  was  made  to  begin  parsing  the  right-hand  side  of  the  produc- 
tion. 


Step 

Stack 

Input 

Action 

1 

0 

aaa_L 

shift  e 

2 

0  3 

aaa_L 

shift  a 

3 

0  3  14 

aa_L 

shift  e 

4 

0  3  14  17 

aa_L 

shift  a 

5 

0  3  14  17  22 

a± 

reduce  D — ►a 

6 

0  3  14  24 

a± 

shift  a 

7 

0  3  14  24  27 

_L 

reduce  S— >aDa 

8 

05 

± 

shift  ± 

9 

0  5  11 

accept 

Figure  4-5.  NLR(O)  parse  of  "aaa±". 


Now  consider  how  one  might  select  the  proper  reduction  in  state  8  of  the  LR(0) 
parser  for  the  string  "aaa_L".  There  are  four  paths  in  the  ?«JLR(0)  machine 
corresponding  to  LR(0)  paths  reaching  state  8.  Consider  first  the  correct  path 
(0,3,14,17,22),  In  making  the  transition  from  state  0  to  3,  the  symbol  "±"  is  tem- 
porarily abandoned  to  be  parsed  later  after  making  a  nonterminal  transition  on  S. 
Likewise,  in  moving  from  state  14  to  17,  the  symbol  "a"  is  left  to  be  parsed  later 
after  a  transition  on  D.  Therefore,  in  step  3  of  the  LR(0)  parse,  the  lookahead  string 
"a±"  is  appropriate  for  the  path  (0,3,14,17,22).  An  examination  of  all  four  paths 
and  their  lookahead  strings  in  given  in  figure  4-6.  From  this,  one  can  observe  that 
the  lookahead  string  "a_L"  is  xmique  and  can  be  used  to  decide  which  reduction  is 
correct  in  step  3  of  the  LR(0)  parse. 
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Path  Lookahead  String;  Action  

(0,3,14,17,22)         a-L  Reduce 
(0,4,15,16,18,23)     bc±  Reduce  A— ^ 

(0,1,6,8,22)  b±  Reduce  D— ^ 
(0,2,7.9,10,23)        ac-L  Reduce  A— »a 

Figure  4-6.  Lookahead  strings  for  paths  leading  to  LR(0)  state  8. 

From  this  example,  one  can  see  that  it  is  possible  to  determine  the  complete 
continuation  languages  associated  with  the  various  paths  leading  to  an  inconsistent 
LR(0)  state  and  to  use  the  continuation  languages  to  determine  the  correct  parsing 
action.  The  example  uses  a  grammar  for  a  finite  language  requiring  two-symbol 
lookahead,  but  we  contend  that  continuation  languages  can  profitably  be  used  to 
achieve  single  symbol,  mviltiple  symbol,  and  arbitrary  lookahead,  all  using  the  same 
theoretical  framework  outlined  in  this  example. 


CHAPTER  5 
DEFINITION  OF  CONTINUATION  LANGUAGES 


The  goal  for  continuation  languages  is  to  use  them  to  perform  lookahead  for 
inconsistent  LR(0)  states.  We  would  like  the  definition  of  continuation  languages  to 
support  this  goal.  We  begin  with  a  simple  definition,  and  then  modify  the  definition 
step-by-step  to  reach  a  definition  that  is  more  amenable  to  computation. 

We  first  note  that  inconsistent  LR(0)  states  exhibit  shift/reduce  conflicts, 
reduce/reduce  conflicts  or  both.  Any  shift  action  from  an  LR(0)  state  q  on  symbol 
X  is  associated  with  the  NLR(O)  states  {A—*-oc.Xf3,q),  for  all  items  A—*-a.X^  con- 
tained in  state  q.  Likewise,  the  reduce  action  for  production  A—*-uj  for  an  LR(0) 
state  q  is  associated  with  the  NLR(O)  state  {A—*-cj.,q).  For  a  pair  of  NLR(O)  states 
p  and  q  representing  conflicting  actions  in  an  LR(0)  state,  our  goal  is  to  distinguish, 
using  some  lookahead  technique,  the  strings  that  can  be  parsed  from  state  p  from 
those  that  can  be  parsed  from  state  q.  We  can  then  vise  this  distinction  to  augment 
the  LR(0)  parser  with  a  lookahead  technique  and  resolve  the  conflict.  Hence,  we 
shall  define  continuation  languages  for  NLR(O)  states. 

Dennition  5-1:  Given  a  grammar  G  and  an  NLR(O)  state  r  ={A—*-a.0,q), 
Continuation {r )  ={zeT*\  3xeT*where  [goja^-^l— *[9o'S'J-]} 

We  can  readily  see  that  there  are  two  parts  to  the  continuation  of  an  NLR(O) 
state  r  ={A—*'a.^,q).  Clearly,  the  first  part  of  the  continuation  is  /?,  which  is  the 
"tail"  of  the  item  in  state  r.  We  shall  also  see  that  the  second  part  of  the  continua- 
tion is  a  "backlog"  of  partially  parsed  productions  from  the  stack.  This  is  illustrated 
in  figure  5-1,  which  displays  several  states  of  an  NLR(O)  machine. 
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.71.  ^B->'5A^.,q2 


A—*-.a^,qQ 

 ^ 

A— *■(>:. /3,q 

A—*'a^.,q3 

Figure  5-1:  Illustration  of  Tail  and  Backlog. 


After  reaching  state  (A— 9),  13  must  be  parsed  next.  After  recognizing 
the  NLR(O)  parser  would  perform  a  reduction.  The  parser  would  pop  states  from  the 
state  stack  until  it  reached  {B—*'6.A%qQ).  Then  it  would  move  forward  on  A  to 
(B—^SA.^qi).  From  this  point,  the  parser  would  parse  7,  the  tail  of  that  item. 
Clearly,  the  parsing  of  7  had  been  postponed  earlier  when  the  e-transition  was  made 
from  {B—*-6.A%qQ)  to  (A—^.a^jqQ).  Similarly,  after  recognizing  -7,  the  parser  might 
resume  parsing  the  tail  of  some  other  third  item.  This  process  could  continue  an 
arbitrary  number  of  times,  parsing  a  backlog  of  tails  of  immediate  successors  of  cer- 
tain states  from  the  parse  stack.  We  can  now  define  Continuation  in  these  new 
terms: 

DeHnition  5-2:  Given  a  grammar  G  and  an  NLR(O)  state  r  ={A—*a.l3,q), 
Continuation (r)  =  Tail{r)  Backlog{r),  where 

Definition  5-3: 

Tail{r)  =  Tail{A^a.l3)  =  and 

Definition  5-4: 

Backlog{r)={zeT*\  3a;,y  e T*  where  [go]^3/'^l~*[Q;^]j/-2|— *[9o'5-L]} 

It  is  quite  simple  to  compute  Tail  for  an  NLR(O)  state.  Unfortunately,  our 
definition  of  Backlog  is  not  quite  so  useful.  We  shall  modify  the  definition  of  Back- 
log imtil  we  reach  a  more  algorithmic  definition.  In  the  process,  we  will  find  it 
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convenient  to  extend  the  domain  of  Backlog  to  include  NLR(O)  paths  and  transitions 
between  NLR(O)  states.  Ultimately,  Backlog  for  an  NLR(O)  state  will  be  computed 
based  on  Backlogs  for  transitions. 

Of  particular  importance  in  the  definition  of  Backlog  is  the  fact  that  x  accesses 
r .  Naturally,  there  can  be  many  such  strings  x  and  each  x  will  follow  some  path  to 
r.  We  can  rephrase  this  definition  in  terms  of  these  paths: 

Definition  5-5:  Given  a  grammar  G  and  an  NLR(O)  state  r  =(A— ►q;./?,^), 

Backlog{r)  =  \JBacklog{7r) 

where  tt  is  any  path  from  the  NLR(O)  start  state  to  r,  and 
Deflnition  5-6: 

Backlog [n)  ={z£  T*\  3x,y  G  T*  where  [^ol^J/'S^  \—  *        |—  *  [7r/5]2 1—  *  [^o^-L]} 

Recall  that  every  transition  is  either  an  €-transition,  or  a  transition  on  a  gram- 
mar symbol  (denoted  as  an  X-transition).^  Furthermore,  it  is  clear  that  e-transitions 
arise  during  computation  of  the  closure  of  an  LR(0)  state,  while  X-transitions  are 
derived  from  LR(0)  transitions. 

Now,  consider  the  contribution  of  an  e-transition  to  the  backlog.  In  figure  5-2 
there  is  an  e-transition  from  (A— ►q;.5/?,9o)  to  (^^.^^o)-  After  making  this  transi- 
tion, the  NLR(O)  parser  mvst  recognize  u!  and  perform  a  reduction  from  state 
{B—^oj.jq^).  In  making  this  reduction,  the  parser  moves  forward  on  B  to 
[A — >^ctB./3,qi).  From  this  state,  /?  must  be  parsed  next.  We  see  then,  that  the  pars- 

^  Pager  refers  to  the  target  of  an  f-transition  as  an  "immediate  successor"  of  the  source 
state  [6].  For  the  target  of  a  grammar  symbol  transition,  he  uses  the  term  "transition 
successor." 
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ing  of  /?  was  postponed  when  making  the  e-transition.  In  general,  an  e-transition 
from  {A—*^a.BP,q)  will  contribute  P to  the  backlog. 


— ^ 

A—*-aB.P,qi 

B-*'.UJ,qo 

B^UJ.,q2 

Figure  5-2:  Contribution  of  an  e-transition  to  the  backlog. 

On  the  other  hand,  an  X-transition  represents  "shifting  the  dot"  in  a  given  pro- 
duction. This  is  illustrated  in  figure  5-3.  When  the  NLR(O)  parser  makes  the  move 
on  X  to  (A— ►qX./?,*/!),  the  change  in  the  continuation  consists  of  the  removal  of  X. 
This  is  embodied  in  the  tails  of  the  two  states.  The  tail  of  the  first  state  is  Xj3,  while 
the  tail  of  the  second  is  simply  Clearly,  nothing  is  removed  or  added  to  the  back- 
log by  an  X-transition. 


4-^q;.X/?,<?o 

— ^ 

4— ►qX./?,^! 

Figure  5-3:  Contribution  of  an  X-transition  to  the  backlog. 
These  two  observations  yield  the  definition  of  the  backlog  for  an  NLR(O)  transi- 


tion: 

Deflnition  5-7: 


Backlog  {{A -*-a.X/3,p),{B->'8.^,q))  =  ' 


P  if  5=6 
e  otherwise 


Having  defined  the  backlogs  for  the  two  types  of  NLR(O)  transitions,  it  is  a  sim- 
ple matter  to  derive  the  backlog  of  an  NLR(O)  path: 
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Theorem  5-1:  For  an  NLR(O)  path  7r=(7ro,7ri,  •  •  •  ,7r„)  where  ttq  is  the  NLR(O) 
start  state,  Backlog {7r)= Backlog {7r„_i,'Kf^)Backlog{'K^_2,'7r„^i)  •  •  ■  Backlog {7rQ,n^). 

Proof:  Without  loss  of  generality,  assume  that  n  >  0.  Then  the  base  case  for  induc- 
tion is  when  n=l.  Clearly,  when  n=l  the  path  is  exactly  1  transition.  Therefore, 
by  the  definition  for  the  backlog  for  an  NLR(O)  transition, 
Backlog  (tt) = Backlog  (tTqjTTi). 

Now,  for  the  inductive  step,  assume  that  the  theorem  is  true  for  all  k<n. 
Then,  considering  the  path  7r=(7ro,7ri,    •  ■  ,7r„)  where  n  >  1, 

Backlog  {{nQfTT^,  •  •  ■  ,7r„_i))=5acA;/o</(7r„_2,7r„_i)  •  •  •  Backlog  {nQjTTi). 

The  transition  from  7r„_j  to  7r„  is  either  i)  an  e-transition  or  ii)  an  X-transition. 

Case  i:  The  transition  from  T^n-i  to  7r„  is  an  e-transition. 

Since  it  is  an  e-transition,  then  we  may  express  7r„_i  and  7r„  as  {A—*-a.B^,q) 
and  {B—*^.uj,q)  respectively.  In  making  this  transition,  /?  is  deferred  to  be  parsed 
immediately  after  uj  is  recognized.  Clearly  /?  must  be  parsed  prior  to 
Backlog  {{tTqjTTi,  •  •  •  ,7r„_J).  Therefore 

Backlog {7r)=^ Backlog {{ttqjTTi,  •  •  •  ,7r„_i)). 

This  can  be  combined  with  the  expression  for  Backlog  {{nQfTTi,  ■  •  ■  ,7r„_J)  from  the 
inductive  hypothesis  yielding 

Backlog {7r)=^ Backlog {ir^_2,'^n-i)  '  '  '  Backlog {TrQ,^^). 
Furthermore,  since  Backlog{{A — *^a.B/3,q),{B — *-.0J,q))  =  /3,  we  have 

Backlog{'K)=Backlog['K^_]^,'K^)  Backlog {'K^_2,T^n-^  ■  ■  ■  Backlog {tVq,'Ki). 

Case  ii:  The  transition  from  tt^-i  to  7r„  is  an  X-transition. 
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Then  we  may  express  7r„_i  and  7r„  as  (A— ►a.X/?,9o)  (^~*ctX"'A9i)- 
making  this  transition,  the  NLR(O)  parser  has  not  deferred  the  parsing  of  anything. 
Hence,  nothing  is  added  to  the  backlog.  In  this  case 

Backlog Backlog {{tTqjTCi,  ■  ■  •  ,7r„_i)) 

and,  since  Backlog[[A—*'a.X^,qQ^,[A—*-oiX.^,q]))=t,  we  have 

Backlog{'K)=Backlog{'Kf^_y,'K^)  Backlog  ['Kj^_2,'Kn-\)  '''  Backlog  {tTq,!^^. 

For  an  example  consider  the  NLR(O)  machine  for  grammar  Gl  presented  in  fig- 
ure 4-3.  As  shown  earlier  there  is  an  LR(0)  conflict  depicted  by  NLR(O)  states  22 
and  23.  There  are  two  paths  leading  to  state  22:  (0,3,14,17,22)  and  (0,1,6,8,22). 
Using  definition  5-2,  we  get  Continuation{22)=  Tail {22)Backlog  {22).  Since  state  22 
is  (Z>— »-a.,8),  Tail{22)=e.  Therefore  Continuation{22)= Backlog (22).  Using  defini- 
tion &-5,  we  obtain  Continuation  (22)  =  fiac^/o^  (0,3,14,17,22)  U  5acA;/o^  (0,1,6,8,22). 
If  we  apply  theorem  5-1,  we  see  that  5acA;/ogf  (0,3,14,17,22)  =  Backlog  {17 ,22)  Back- 
log {14,17)  Backlog  {3,14)  Backlog  {0,3).  Then  we  can  apply  definition  5-7  to  obtain 
jBacA;/o^  (0,3,14,17,22) =eae±  =  a  _L.  Similarly  we  can  determine  that 
5acA;/o^  (0,1,6,8,22)  =  6  ±.  These  two  strings  can  be  combined  to  obtain  Continua- 
tion{22)  =  a±  +  6_L.  Likewise,  we  can  derive  that  Continuation {23)  =  acA.  + 
bc±. 

The  preceding  theorem  and  definitions  provide  a  technique  for  computing  con- 
tinuation languages  based  on  paths  in  the  NLR(O)  machine.  Of  course,  in  the 
NLR(O)  machines  for  most  interesting  grammars  there  will  be  states  with  an  infinite 
number  of  paths  leading  to  them.  We  shall  show  that  the  path  expression  algorithms 
defined  by  Tarjan  [18]  can  be  used  to  compute  continuation  languages  for  any  state 
of  an  NLR(O)  machine. 


CHAPTER  6 
COMPUTING  CONTINUATION  LANGUAGES 


We  have  shown  that  computing  the  continuation  language  for  an  NLR(O)  state 
consists  of  computing  Tail  and  Backlog  of  the  state.  The  Tail  computation  for  an 
NLR(O)  state  is  trivial  since  it  is  simply  the  portion  of  the  production  following  the 
dot  in  the  item.  In  contrast,  the  Backlog  computation  for  an  NLR(O)  state  is  some- 
what complex.  Backlog  for  an  NLR(O)  state  is  defined  as  the  union  of  the  Backlogs 
for  all  paths  in  the  NLR(O)  machine  leading  to  the  state.  If  the  NLR(O)  machine  for 
a  grammar  contains  any  cycles,  then  there  are  an  infinite  number  of  possible  paths 
for  at  least  one  state.  Therefore,  it  is  convenient  to  use  regular  expressions  to 
express  cyclic  contributions  to  Backlog . 


Tarjan's  Path  Expression  Algorithm 


An  efficient  algorithm  is  presented  by  Tarjan  [18]  for  solving  path  problems  for 
a  graph.  The  basic  path  problem  addressed  by  Tarjan  is  the  single-source  path 
expression  problem,  in  which,  for  a  flow  graph  G,  a  regular  expression  P{s,v)  must 
be  computed  for  each  vertex  v  in  G  describing  all  paths  in  G  from  the  source  vertex 
s  to  V.  Tarjan's  algorithm  can  be  adapted  to  compute  homomorphic  images  of 
paths,  and,  if  applied  to  the  graph  derived  from  the  NLR(O)  automaton,  it  can  be 
used  to  compute  Backlog  for  NLR(O)  states. 

Tarjan  introduces  the  the  concept  of  a  path  sequence  for  a  graph  G={V,E). 

Definition  6-1:  A  path  sequence  [18]  is  a  sequence  {Pi,Vi,Wi),  (i'2»'^2>^2)>  "  '  " 
{Pi,vi,wi)  such  that 
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1)  for  l<i</,  P,- is  an  unambiguous  path  expression  of  type  («,-,«;,•); 

2)  for  l<i<l,  if  Vi  =  Wi,  then  cgP,-;  and 

3)  for  an  nonempty  path  tt  in  G,  there  exists  a  unique  sequence 
l<h<*2<  ■  ■  ■  <*t<^  ^'^^  ^  unique  partition  of  tt  into  nonempty  paths 
7r=7ri7r2  •  •  •  tt^^  where  7ryeP,v  for  l<y<Ar. 

For  an  example  of  a  path  sequence  consider  the  graph  in  figure  6-1.  In  this 
graph,  each  edge  is  labeled  uniquely  with  a  single  letter  to  simplify  the  writing  of 
paths.  Each  vertex  is  numbered  and  vertex  0  is  the  start  vertex  for  the  graph. 


Figure  6-1.  Graph  to  illustrate  Eliminate  and  Solve. 


One  path  sequence  for  the  graph  in  figure  6.1  is 


(  a,  0,  1  ) 
(  b,  0,  3  ) 
(  c,  1,  4  ) 
(  dc,  2,  4  ) 
(  (eb)*,  3,  3  ) 
(  f,  3,  4  ) 
(  (gdc)*,  4,  4  ) 
(  g,  4,  2  ) 
(  e,  3,  0  ) 
(  ea,  3,  1  ) 
(  d,  2,  1  ) 


Now  consider  the  string  bebebebfg  which  represents  one  path  from  vertex  0  to 
vertex  2.  We  can  construct  one  and  only  one  path  expression  of  type  (0,2)  which 
includes  bebebebfg  by  concatenating  selected  expressions  from  the  path  sequence  in 
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order;  namely  b{eb)*fg.  Furthermore  this  path  expression  is  unambiguous,  which 
means  that  bebebebfg  matches  b{eb)*fg  in  exactly  one  way. 

The  salient  characteristic  of  a  path  sequence  is  that  any  path  tt  is  a  member  of 
the  concatenation  of  path  expressions  of  some  subsequence  of  that  path  sequence. 
This  feature  makes  solving  the  single-source  path  expression  problem  easy  given  a 
path  sequence  using  the  following  code  for  Solve  [18].  Also  listed  are  the  functions 
Cat,  Or  and  Star  which  are  used  to  produce  simple  regular  expressions  that  do  not 
contain  subexpressions  of  the  form  xe,  ex  or  e*. 


procedure  Solve; 
P{s,s)  =  e; 
for  each  v  G  V'— {«} 

P{s,v)  =  <P; 
end; 

for  1=1  to  / 
if  V,-  =  «;,•  then 

P{s,Vi)=Cat{P{s,Vi),Piy, 
else 

P{s,Wi)=Or{Pis,Wi),Cat{P{s,Vi),Pi)); 
endif; 
end; 
end; 

function  Cat{x,y) 
if  x=(j)or  y  =<f)  then 

return  <^ 
else  if  a:  =e  then 

return  y; 
else  if  1/  =  e  then 

return  x; 
else 

return  xy ; 
endif; 
end; 
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function  Or{x,y) 
if  x=(f)  then 
return  y; 
else  if  3/  =  0  then 

return  x; 
else 

return  x+y; 
endif; 
end; 

fimction  Star{x) 

if  I  =  e  or  X  =  0  then 

return  e; 
else  ^ 
return  x  ; 
endif; 
end; 

Tarjan  presents  a  second  procedure  called  Eliminate  to  compute  a  path 
sequence  for  a  graph  G={V,E),  where  V={vq,Vi,  ■  ■  •  v„_i}.  This  procedure  uses  a 
two-dimensional  array  P  to  accumulate  path  expressions  with  special  properties 
between  pairs  of  vertices.  Eliminate  does  not  compute  all  possible  path  expressions 
between  all  pairs  of  vertices,  and,  hence  P  can  be  represented  using  sparse  matrix 
techniques. 


procedure  Eliminate ; 
for  i=0  to  n— 1 
for  j  =0  to  n— 1 

end; 
end; 

for  each  e  =(v,-,v^)6£^ 

P{i,j)=Or{P{i,j),ey, 
end; 

for  i=0  to  n— 1 

P(t-,t)  =  5<ar(P(t-,i)); 
for  each  j  >i  where  P{j,i)j^(f) 
P{j,i)  =  Cat{P{j,i),P{i,i)); 
for  each  k  >i  where  P{i,k)^(f) 

P{j,k)=0r{P[3,k\Cat{P{j,i),P{i,k))); 
end; 
end; 
end; 
end; 
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After  executing  Eliminate,  the  path  expressions  P{u,w)  can  be  used  to  form  a 
path  sequence  for  G.  The  elements  of  {{P {u, w), u, w)\ P{u,w)^{e,(l)}  a,nd  u  <w} 
listed  in  increasing  order  on  u,  followed  by  the  elements  of 
{{P{u  ,w),u  ,w)\  P{u  ,w) 7^ <f>  a.nd  u  >  w}  listed  in  decreasing  order  on  u  form  a  path 
sequence  for  G. 

For  an  application  of  Eliminate  and  Solve  consider  again  the  graph  in  figure  6- 
1.  We  begin  with  procedure  Eliminate .  After  the  pair  of  first  nested  loops  executes, 
the  array  P  is  filled  with  (f).  Following  this,  each  edge  of  the  graph  is  placed  in  P. 
The  resulting  array  is  shown  in  table  6-1.  Entries  in  the  table  which  are  blank 
represent  (p. 

Table  6-1.  The  array  P  after  the  first  pair  of  loops  in  Eliminate. 


0 

1 

2 

3 

4 

0 

a 

b 

1 

c 

2 

d 

3 

e 

f 

4 

After  entering  the  edges  into  P,  Eliminate  executes  the  triply- nested  loop  which 
combines  edges  to  form  path  expressions  between  selected  pairs  of  vertices.  The 
array  P  after  the  execution  of  Eliminate  is  shown  in  table  6-2. 
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Table  6-2.  The  array  P  after  executing  Eliminate. 


0 

1 

2 

3 

4 

0 

e 

a 

b 

1 

e 

c 

2 

d 

e 

dc 

3 

e 

ea 

feb)* 

f 

4 

fsdc)* 

From  the  array  P  we  obtain  a  path  sequence  for  the  graph  by  listing  the  ele- 
ments of  {{P{u,w),u,w)\  P{u,w)i^{e,(f)}  and  u<w}  in  increasing  order  on  u,  followed 
by  the  elements  of  {{P{u,w),u,w)\  P{u,w)^(f)  and  u  >w}  in  decreasing  order  on  u: 


(  a,  0,  1  ) 
(  b,  0,  3  ) 
(c,  1,4) 
(  dc,  2,  4  ) 
(  (eb)*,  3,  3  ) 
(    3,  4  ) 
(  (gdc)*,  4,  4  ) 
(  g,  4,  2  ) 
(  e,  3,  0  ) 
(  ea,  3,  1  ) 
(  d,  2,  1  ) 


This  path  sequence  is  then  processed  by  Solve  to  produce  path  expressions  from 
vertex  0  to  each  vertex  of  the  graph.  When  Solve  completes  we  have  the  following 
path  expressions: 


P(0,0)     =  b(eb)*e+e  =  (be)* 

P(0,1)      =  a4-b(eb)*ea-Kac-|-b(eb)*f)(gdc)*dg 

=  (be)*a-Kac-hb(eb)*f)(gdc)  dg 

P(0,2)      =  (ac-hb(eb)*f)(gdc)*g 

P(0,3)     =  b{eb)* 

P(0,4)     =  {ac-hb(eb)*f)(gdc)* 
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An  Algorithm  to  Compute  Backlog 

The  procedures  Solve  and  Eliminate  compute  path  expressions  for  a  graph,  but 
we  need  to  compute  Backlog  for  NLR(O)  states.  Backlog  for  a  state  is  the 
homomorphic  image  of  the  reversal  of  a  path  expression  denoting  ail  paths  from  the 
start  state.  We  can  modify  these  procedures  to  compute  homomorphic  images  of 
path  reversals  by  adding  assignment  statements  for  an  array  H  that  will  maintain 
homomorphic  images  of  reversals  of  corresponding  entries  in  P. 

We  extend  the  concept  of  a  path  sequence  to  include  these  homomorphic 
images. 

Definition  6-2:   A  homomorphism  sequence  for  a  graph  G  =  {V,E)  and  a 
homomorphism    H:E—*^T*    is    a   sequence    {Hi,Pi,Vi,Wi),    (ff2»^2>^2'^2)>  '  "  ' 
{Hi,Pi,Vi,wi)  such  that 

1)  {Pi,Vi,Wi),{P2,V2,u>2),  '  ■  ■  {Pi,vi,u)i)  is  a  path  sequence  for  G. 

2)  For  l<i< I,  Hi  =H{Pi'). 

Given  a  graph  G,  a  homomorphism  H  and  a  homomorphism  sequence 
{Hi,Pi,Vi,Wi),  (/^2>-f*2»^2»'^2)>  ■  ■  ■  mi,Pi,Vi,Wi)  we  can  adapt  Solve  to  compute  the 
reverse  homomorphic  images  of  the  path  expressions  computed  by  Solve .  This  new 
procedure,  Solve',  is  listed  below. 
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procedure  Solve'; 
H{s,s)=P{s,sUe' 
for  each  v  G  V — [s] 

H{s,v)=P{s,v)=<jr, 
end; 

for  t  =  1  to  / 
if  Vi  =  w,-  then 

P[s,Vi)  =  Cat[P{s,Vi),Pi); 
H{s,Vi)  =  Cat[Hi,H{s,Vi)); 

6ls6 

P[s,Wi)=Or{P{s,Wi),Cat{P{s,Vi),Pi)); 
H{s,Wi)  =  Or{H{s,Wi),Cat[Hi,H{s,Vi))); 
endif; 
end; 
end; 

Lemma  6-1:  Let  G={V,E)  be  a  graph  and  let  H:E—*'T*  be  a  homomorphism  of 
the  edges  of  G.  Let  {H^,P^,v^,'Wi\,{H2,P2^'^2^'^2)^  '  '  '  mi,Pi,Vi,Wi)  be  a  homomor- 
phism sequence  for  G  and  H.  After  k  iterations  of  the  loop  on  i  in  Solve', 
H{s,v)=H{P{s,v)')ioT  2\\v^V. 

Proof:  First,  let  k=0.  Then  H{s,s)=P{s,s)=e=H{e)=H{P[s,s)').  For  all  other 
vertices  v,  H{s,v)=P{s,v)  =  <f)=H{(f>)H{P{s,v)'). 

Now,  assume  that  the  lemma  is  true  for  k—1  and  consider  what  happens  during 
the  A;'th  iteration.  There  are  two  cases:  i)  Vj^  =Wff  and  ii)  v^t  ^^w^. 

case  \)  Vff=W)i. 

Here,  H{s,Vic)  will  be  replaced  with  HiiH{s,Vii).  We  know  that  for  iteration 
k—1,  H{s,Vif)=H{P{s,vif)')  and  we  know  that  Hif=H(Pf^').  Since  H  is  a  homomor- 
phism, H{s,v,,)  will  be  replaced  with  H{P{s,VkY)H{Pk')=H{{P{s,vi,)Pi,y).  In  the 
same  iteration  of  the  loop,  P{s,Vj()  is  replaced  with  P(s,V;t)-^*'  Thus  after  iteration 
k,H{s,v,,)=H{P{s,Vk)'). 

case  ii)     ^  . 

Here  H{s,'Wi^)  will  be  replaced  with  H{s ,Wif)+HiiH{s ,vii).  We  know  that  for 
iteration  ^—1,  i^(s,Wjt)=/f(P(s,«;^)')  and  H[s ,Vii)=H{P{s ,Vj^)').  We  also  know  that 
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Hi=H{Pk).  Therefore  H{s,Wk)  will  be  replaced  with  H{P{s,Wk)')  +  H{Pk) 
H{P{s,Vky).  Since  H  is  a  homomorphism,  H{s,Wk)  is  replaced  with  H{{P{s,Wi^)  + 
P{s,Vif)Piiy).  In  the  same  iteration  of  the  loop,  F(s,«;^)  will  be  replaced  with 
P{s,wic)  +  P{s,vii)Pii.  Thus  after  iteration  k,  H{s,wif)  =  H{P{s,wiiy). 

Theorem  6-1:  Let  G  ={V,E)  be  a  graph  and  let  H:E—*-T*  be  a  homomorphism  of 
the  edges  of  G.  Let  {Hi,Pi,Vi,Wi),{H2,P2,V2,yJ2)>  '  '  '  {HhPi,Vi,Wi)  be  a  homomor- 
phism sequence  for  G  and  H.  After  executing  Solve',  H{s,v)=H{P{s,vy)  for  all 
v€V. 

Proof:  The  previous  lemma  applied  for  k=l  proves  the  theorem. 

Theorem  6-1  shows  that  Solve  can  be  extended  to  compute  homomorphic 
images  of  path  reversals  given  a  homomorphism  sequence.  We  will  show  that  the 
adaptation  of  Eliminate  shown  below  can  compute  a  homomorphism  sequence. 


procedure  Eliminate '; 
for  I  =0  to  n— 1 
for  j  =0  to  n— 1 

H{i,j)=P{i,j)  =  <Pi 
end; 
end; 

for  each  e  =  {v,,Vj)eE 

P(i,jUOr(P{i,j\,ey, 

H{i,j)=Or\H{i,j),H{e)y, 
end; 

for  » =0  to  n— 1 

P(i,i]=Star(P(i,j\y, 
H\i,i)  =  Star[H\i,j)y, 

for  each  j  >  i  where  P{j,i)^<f)  (or  H{j,i)^ 
P(j/)=Cat(P{j,i),P(i,i)y, 


end; 
end; 
end; 
end; 


Hlj,i)=CatlH{i,i)MjJ)y, 
for  each  k  >  i  where  P{i,k)r^<f)  (or  H[i,k)ji(j>) 

P(j,kUOr(P(j,k\,Cat{P{j,iyP{i,km 

H{j,k)=Or\H{j,k),Cat{H{i,kyHU,t))y, 


38 


Theorem  6-2:  Let  G={V,E)  be  a  graph,  let  G"=(F,£")  be  the  inverse  of  G  and 
let  H:E'—*-T*  be  a  homomorphism  of  the  edges  of  G\  Let 
(//^i,Pi,Vi,Wi),(/f2>^2'^2>^2)»  ■  ■  ■  i^hPhVijiVi)  be  a  homomorphism  sequence  for  G 
and  /f.  After  executing  Eliminate',  H{s,v)=H{P{s,vy)  for  all  veV. 

Proof:  In  Eliminate',  each  assignment  to  an  element  of  P  is  followed  by  an 
appropriate  assignment  to  the  corresponding  element  of  H. 

In  the  first  pair  of  nested  loops,  H{i,j)  and  P{i,j)  are  both  assigned  the  value 
<f).  After  the  loops  complete,  H{i,j)  =  <f>=H{P{i,j)')  for  all  pairs  of  i  and  j. 

In  the  loop  on  c,  P{i,j)  is  replaced  with  Or{P{i,j),e).  Then  H{i,j)  is  replaced 
with  Or{H{i,j),H{e')).  Immediately  prior  to  these  assignments,  H{i,j)=H{P{i,j)'). 
Therefore,  H{i,j)  is  replaced  with  Or{H{P{i,j)'),H{e'))  which  is  the  same  as 
H{Or{P{i,j),e)')  from  the  previous  iteration  of  the  loop.  After  this  loop  completes, 
H{i,j)  =  ^=H[P{i,j)')  for  all  pairs  of  i  and  j. 

Similar  arguments  can  be  stated  for  each  of  the  three  pairs  of  assignments  to  H 
and  P  in  the  nested  loops  on  i,  j,  and  k.  The  essential  property  is  that  each  assign- 
ment to  an  element  of  P  is  followed  by  a  homomorphism-preserving  assignment  to 
an  element  of  H.  After  executing  Eliminate',  H{s,v)=H{P{s,vy)  for  all  v£V. 

It  should  be  evident  that  the  calculations  for  H{s,v)  in  procedures  Solve'  and 
Eliminate'  are  independent  of  the  calculations  for  P{s,v).  This  means  that  we  can 
remove  those  calculations  and  calculate  homomorphic  images  of  path  expressions 
without  computing  the  path  expressions. 

For  an  application  of  Eliminate'  and  Solve'  consider  the  graph  in  figure  6-2. 
This  is  the  graph  derived  from  a  hypothetical  NLR(O)  machine.  To  simplify  the 
analysis,  the  vertices  are  numbered  with  vertex  0  being  the  start  vertex  for  the 
graph,  and  each  edge  is  labeled  with  Backlog  of  that  edge. 
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Figure  6-2.  Graph  to  illtistrate  Eliminate '  and  Solvt 

We  begin  with  procedure  Eliminate'.  However,  we  do  not  show  the  contents  of 
P,  since  the  computation  of  H  is  not  dependent  on  P.  After  the  first  loop  executes, 
the  array  H  is  filled  with  (f>.  Following  this,  Backlog  of  each  edge  of  the  graph  is 
placed  in  H.  The  resulting  array  is  shown  in  table  6-3.  Entries  in  the  table  which 
are  blank  represent  <p. 

Table  6-3.  The  array  H  after  the  first  pair  of  loops  in  Eliminate 


0 

1 

2 

3 

4 

0 

_L 

± 

1 

de 

e 

2 

3 

Ab 

Ab 

4 

c 

c 

After  entering  the  edges  into  H,  Eliminate'  executes  the  triply-nested  loop 
which  combines  elements  of  H  to  form  path  expressions  between  selected  pairs  of 
vertices.  The  array  H.  after  the  execution  of  Eliminate '  is  shown  in  table  6-4. 
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Table  6-4.  The  array  H  after  executing  Eliminate'. 


0 

1 

2 

3 

4 

0 

e 

± 

± 

1 

e 

de 

e 

2 

e 

3 

(Abl* 

Ab 

4 

c 

fAbl*c 

(AbfAbfcl* 

The  definition  of  a  homomorphism  sequence  requires  a  path  sequence.  How- 
ever, since  we  are  not  computing  P,  we  will  list  the  homomorphism  sequence 
without  the  P,-  entries.  From  the  array  H  we  obtain  a  path  sequence  for  the  graph 
by  listing  the  elements  of  {^H{u ,w),u ,w)\  either  H[u,w)j^(j)  and  u<w  or 
H{u ,w)'fi{e,^  and  u=u;}  in  increasing  order  on  u,  followed  by  the  elements  of 
^H{u,w),u,w)\  H{u,w)9^(l)  and  u  >  «;}  in  decreasing  order  on  u: 


(  -L,  0,  1  ) 

(  -L,  0,  3  ) 

(  de,  1,  2  ) 

(e,  1,4) 

(  (Ab)*,  3,  3  ) 

(  Ab,  3,  4  ) 

(  (Ab(Ab)*c)*,  4,  4  ) 

(c,  4,  2) 

( (Ab)*c,  4,  3  ) 


This  sequence  is  then  processed  by  Solve '  to  produce  Backlog  for  each  original 
NLR(O)  state.  When  Solve '  completes  we  have  the  following  expressions: 


H(0,0)  =  e 

H(0,1)  =  ± 

H(0,2)  =  de±-k(Ab(Ab)*c)*(Ab)*± 

H(0,3)  =  (Ab)*±-KAb)*c(Ab(Ab)*c)*(Ab)*± 

H(0,4)  =  (Ab(Ab)*c)*(Ab)*_L 
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Procedures  Solve'  and  Eliminate'  employed  with  H = Backlog  provide  an  effi- 
cient technique  for  computing  Backlog  for  NLR(O)  states.  Concatenating  Tail{q) 
and  Backlog [q)  for  an  NLR(O)  state  q,  we  have  an  algorithm  to  compute 
Continuation  [q). 

Characterization  of  Continuation  Languages 

The  algorithm  for  computing  continuation  languages  computes  them  as  regular 
expressions  of  terminals  and  nonterminals.  In  a  continuation  language  a,  each  non- 
terminal A  represents  L{A).  Thus  A  represents  a  context-free  language.  Trivially 
each  terminal  symbol  t  in  a  also  is  a  context-free  language.  Since  context-free 
languages  are  closed  under  the  three  regular  expression  operators  [14],  we  have  the 
following  theorem: 

Theorem  6-3:  For  an  NLR(O)  state  q,  Continuation (q)  is  a  context-free  language. 

For  a  continuation  language  a,  we  may  construct  an  extended  context  free 
grammar  for  a  by  creating  a  new  start  symbol  R  and  starting  a  set  of  productions  P 
with  R—^-a.  Then  we  must  continue  to  add  to  P  any  new  productions  of  the  form 
A—*-l3  for  any  A  appearing  on  the  right  hand  side  of  a  production  in  P.  When  this 
process  finishes  we  have  an  ECFG  for  a.  This  ECFG  can  then  be  converted  into  a 
CFG  by  repetitively  applying  two  grammar  transformations  [16]. 

Having  characterized  continuation  languages  as  being  context-free  languages,  we 
now  consider  whether  they  are  some  special  subset  of  the  context-free  languages.  We 
can  show  with  an  example  that  they  can  be  any  arbitrary  context-free  language. 

Let  A  be  the  start  symbol  of  an  arbitrary  context-free  language  with  grammar 
^A={^Ay'^A^PAj^)'    Then  construct  a  new  grammar  G    =   (  N^U{S,B,C}, 


42 


r^U{6},  P\J{S^Cb,S-»-BA,C-*b,B^b},  S  )  by  adding  new  symbols  B,  C,  S, 
and  b .  The  LR(0)  machine  for  G  will  have  the  general  form  depicted  in  figure  6-4. 


Figure  6-4.  LR(0)  automaton  for  G. 

In  the  LR(0)  machine  for  G,  we  see  that  there  is  a  reduce/reduce  conflict  in 
state  7.  It  is  easy  to  see  that  Continuation{B—*-b A.  Therefore,  we  can  con- 
struct a  grammar  with  an  NLR(O)  state  having  any  context-free  language  for  its  con- 
tinuation. 

Simplifying  Regular  Expressions 

The  continuation  languages  computed  as  described  previously  are  regular 
expressions  over  the  vocabulary  of  terminals  and  nonterminals.  It  would  be  desirable 
to  produce  the  "simplest"  regular  expressions  to  aid  human  imderstanding  and  to 
simplify  automated  parser  generation.  The  concept  of  one  regular  expression  being 
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"simpler"  than  another  is  difficult  to  define,  but  there  are  several  regular  expression 
identities  that  can  be  used  to  transform  a  regular  expression  into  another  that  is 
likely  to  be  simpler  in  some  fashion.  These  identities  are  shown  below. 

1)  a4{/^)=a+/^. 

2)  Q{pi)=ah. 

3)  If  L  {a)QL  i/J),  then  a+^=  /3. 

4)  e+aa*=a  and  e-\<x*a=a. 

5)  If  L  (a  )CL  {/f),  then       = /f  and  /Ta=^. 

6)  (a)*=a. 

7)  {aa)*=a  and  (aa)*=a*. 

8)  {a*+/3f={a+/3f. 

9)  {a*0')*={a+/3)\ 

10)  If  L  {a)QL  {(f),  then  = /T. 

These  regular  expression  identities  bear  resemblance  to  two  well-known  regular 
expression  problems:  containment  and  star  height.  The  containment  problem  for 
regular  expressions  a  and  /3  consists  of  determining  if  L  {c()QL  (/?).  It  has  been  shown 
that  the  containment  problem  is  CSL-complete  [19],  where  CSL-complete  means  that 
the  problem  reqmres  a  non-deterministic  linear  bounded  automaton  for  solution. 
Being  CSL-complete  means  that  the  problem  is  NP-complete  [13].  Therefore  there  is 
little  hope  for  an  efficient  algorithm  to  correctly  apply  identities  3,  5  and  10. 

The  other  classic  regular  expression  problem  is  the  star  height  problem.  The 
star  height  of  a  regular  expression  is  the  maximum  nestedness  of  stars  in  the 
expression  [20].  Identities  6,  7,  8  and  9  are  fairly  simple  patterns  to  transform  a  reg- 
iilar  expression  into  one  with  a  smaller  star  height.  Star  height  is  defined  formally 
(using  the  abbreviation  "sA"  to  mean  star  height)  as 
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1)  sh{a)  =  sh{e)=0; 

2)  sh{a+j3)  =  sh{al3}=max{sh{a),sh{/3});  and 

3)  sh{a)=sh{a)+l. 

The  star  height  of  a  regular  language  is  the  minimum  star  height  for  any  regu- 
lar expression  representing  that  language.  The  star  height  problem  is  that  of  discov- 
ering an  algorithm  to  determine  the  star  height  of  an  arbitrary  regular  language. 
There  has  been  considerable  research  into  the  star  height  problem  without  producing 
such  an  algorithm  [21,22].  At  one  time  it  was  not  known  if  there  existed  regular 
languages  with  star  heights  greater  than  1.  However,  it  is  now  known  that  there 
exist  regular  languages  of  any  arbitrary  star  height  [20]. 

Considering  these  two  classic  problems  and  their  relationship  to  the  10  regular 
expression  identities,  it  is  unlikely  that  any  efficient  algorithm  can  be  found  to 
correctly  simplify  expressions  using  these  identities.  Consequently,  we  have  imple- 
mented simplification  procedures  that  only  approximately  implement  the  identities. 
The  basic  difference  is  that  the  test  for  containment  is  greatly  simplified.  The  C 
language  code  for  the  containment  function.  Contains,  is  listed  in  the  appendix. 
Contains  is  simplified  without  causing  erroneous  applications  of  the  identities  by 
either  returning  containment  or  uncertainty.  Identities  3,  5  and  10  are  only  applied 
when  the  containment  function  identifies  a  definite  containment. 

Contains  is  a  function  with  four  parameters.  The  first  parameter,  el,  is  an 
expression  that  may  or  may  not  contain  the  second  parameter,  e2,  which  is  also  an 
expression.  The  third  and  fourth  parameters  are  integers  that  define  which  subex- 
pression of  e2  is  to  be  tested  for  containment  in  el. 

The  first  work  done  by  Contains  is  to  take  care  of  the  simple  cases.  If  e  1  and 
e2  are  equivalent  expressions,  then  Contains  returns  true.  If  this  test  fails  and  if 
e2=e,  then  Contains  returns  true  if  el  is  nullable  or  false  if  el  is  not  nullable.  If 
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these  simple  tests  do  not  answer  the  containment  question,  then  Contains  proceeds 
with  a  case- by-case  analysis  based  on  type  of  expression  represented  by  el. 

If  e  1  is  either  e  or  a  symbol  node,  then  Contains  returns  false.  This  is  reason- 
able since  e  can  only  contain  itself  and  an  equality  test  has  already  failed.  Similarly  a 
symbol  may  only  contain  itself. 

If  e  1  is  an  expression  of  the  form  a  ,  then  Contains  checks  for  containment 
based  on  the  type  of  expression  represented  by  e2.  If  e  2  is  a  symbol  node,  then 
Contains  calls  itself  to  see  if  c  2  is  in  a.  If  e2  is  an  OR  expression,  then  Contains 
checks  each  child  of  e 2  to  see  if  it  is  contained  in  cl.  If  e  2  is  of  the  form  /T,  then 
Contains  calls  itself  to  see  if  el  contains  /3.  Lastly,  if  e2  is  a  concatenation,  then 
Contains  checks  to  see  if  e2  is  contained  in  a.  If  not,  then  Contains  checks  to  see  if 
c2  can  be  split  into  subexpressions  such  that  each  subexpression  is  contained  in  el. 

If  e  1  is  an  OR  expression,  then  Contains  checks  for  containment  based  on  the 
type  of  e  2.  If  e  2  is  also  an  OR  expression,  then  every  child  of  e  2  must  be  contained 
in  cl  or  else  Contains  returns  false.  If  e2  is  any  other  type  of  expression,  then  Con- 
tains checks  to  see  if  e  2  is  contained  in  any  child  of  e  1. 

If  el  is  a  concatenation,  then  Contains  checks  for  containment  based  on  the 
type  of  e 2.  If  e 2  is  either  a  symbol  or  an  expression  of  the  form  a,  then  Contains 
looks  for  containment  of  e  2  by  a  child  of  e  1  with  all  other  children  of  e  1  being  null- 
able.  If  e  2  is  an  OR  expression,  then  Contains  checks  each  child  of  e  2  to  see  if  it  is 
contained  in  el.  Finally,  if  e2  is  a  concatenation,  the  Contains  uses  the  function 
TailContains  to  determine  if  the  children  of  e  2  can  be  matched  with  the  children  of 
el  in  pieces  with  the  unused  children  of  el  being  nullable. 

The  testing  done  by  Contains  is  fairly  extensive,  but  is  certainly  not 
comprehensive.  For  example,  the  expression  (ab)*  clearly  contains  a{ba)*.  However, 
Contains  will  not  detect  the  containment.  Such  simple  examples  would  not  be  likely 
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to  occiir  in  a  continuation  language.  Unfortunately,  more  complex  examples  do 
occur. 

Consider  grammar  G2  shown  in  figure  6-5.  The  LR(0)  machine  for  G2  has  a 
conflict  in  state  2  with  one  conflicting  action  being  a  reduction  by  E—*-T.  Continua- 
tion analysis  shows  that  Backlog  [E^T  .,2)  =  [[[±-|-  -T\-T]*l.  ]-f- 
-T[-T\*)m/F[/F]*  )]-H[-T  +/F\/F\*  -T\[-T\*t  [[-L  + -r[-T]*±  \+/F\/F]*[L 
+    -T[-n±    ]]-f[[)    +/F\/Fn    ^[-T-^/F[/F\    *-T\[-T\y  /F\/F 

-n-rnw)  -^/F\/Fn+\-T  ^/f\/f]  *-n-Ttt  /fwl  h-  -t[-t]*± 

_T[-T]*)][[)  +/F\/F]*  )]^-T  -H/F[/F]*-r][-T]*)]*[[±  +  -T[-T]*±  ]+/^(/F]*[_L 
+  -r[-T]*±  ]]]]]+)[[[)  -H/F[/F]*)]-f-[-T  *-T\[-T\*t[\L  ^  -T\-T\*L 

\+/F\/F\\L  -f  -T[-r]*±  -h/F[/Fr-r][-r]*)r 

-T[-T\*m  +/F[/F] -T][-r]*)]*  /F]*[[±  -T[-r]*± 

_T[-r]*)][[)  )]-h[-r  ^/F[/F\*-T\[-T]*t[[i.  -h  -r[-r]*± 

-f-  -T[-T|*±  ]]]]].  After  simplification  we  obtain  Backlog {E-*T., 2)  =  [[-T]*± 
-f[-r]*)[[[/F]*  [-T]Y{/F]*{-T]*±  +[[/F]*\-Tn*  /F{/F  +[-T]*)[{/F]*[-Tn 
*/F]*[[-T]*±  +[-T]*)  [[/F]*[-T]*)]*  [/F]*[-r]*±  ]]].  This  is  still  a  fairly  complex 
expression  that  can  be  further  simplified  to  {-T]*±  +  [-T]*)  [[/F]*[-T]*)]* 
[/F]*[— T]*_L.  This  last  simplification  step  would  require  an  improved  containment 
function  in  order  to  be  automated. 

E^E-T  E^T 
T-*T/F  T^F 
F-*{E)  F-^i 

Figure  6-5.  Grammar  G2. 


The  simplification  algorithms  implemented  during  this  research  provide  significant 
improvement  over  the  original  regular  expressions.  However,  there  is  significant  cost 
involved  in  these  algorithms.  Fortunately,  simplification  is  not  reqviired  in  order  to 
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Mse  continuation  languages  for  lookahead.  The  basic  need  for  simplification  is  for  the 
presentation  of  continuation  languages  to  people. 


CHAPTER  7 
IMPROVEMENTS  TO  THE  ALGORITHM 


Our  algorithm  to  compute  Backlog  consists  of  executing  Solve'  using  the 
homomorphism  sequence  calculated  by  Eliminate'.  In  this  section  we  present  two 
basic  techniques  for  reducing  the  complexity  of  this  computation.  We  shall  assume 
in  our  analysis  that  the  functions  Or,  Cat  and  Star  execute  in  constant  time. 

The  time  complexity  of  Solve'  is  0{l)  where  /  is  the  nmnber  of  quadruples  in 
the  homomorphism  sequence.  Since  Eliminate'  computes  the  homomorphism 
sequence,  the  complexity  of  Eliminate'  is  at  least  as  great  as  the  complexity  of 
Solve '. 

If  we  assume  that  the  initialization  code  in  Eliminate '  is  rendered  mmecessary 
by  storing  only  values  for  H{i,j)  that  are  not  equal  to  <j),  then  the  time  complexity  of 
Eliminate'  is  due  to  the  nested  loops  on  i,  j  and  k.  For  iteration  i  of  the  outermost 
loop,  let  Ni  be  the  number  of  vertices  j >i  such  that  P{j,i)j^(j>.  Likewise,  for  itera- 
tion j  of  the  middle  loop,  let  M,y  be  the  number  of  vertices  k>i  such  that 

n-l 

P{i,k)^(j).  Then  Eliminate'  v&(\vi\ves  0{  J^N^M^j)  time  [18]. 

1-0 

The  complexity  for  Eliminate'  for  a  particular  graph  is  dependent  on  the  com- 
plexity of  the  graph  and  the  ordering  of  the  vertices.  The  worst  case  occurs  for  a 
dense  graph,  which  requires  0{n^)  time.  Considerable  effort  has  been  spent  search- 
ing for  good  reordering  schemes  for  graph  problems  and  for  related  sparse  matrix 
problems  [18,23]. 

We  present  two  basic  techniques  for  improving  the  performance  of  computing 
Backlog .  The  first  technique  consists  of  reducing  the  nmnber  of  vertices  in  the  graph 
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to  be  analyzed,  while  the  second  consists  of  decomposing  the  graph  into  its  strongly 
connected  components  (SCCs),  building  homomorphism  sequences  for  the  SCCs  and 
combining  these  to  form  a  homomorphism  sequence  for  the  entire  graph.  The  second 
technique  may  be  applied  either  to  the  original  graph  or  to  the  result  of  the  first 
technique. 

Reducing  the  Number  of  Vertices  in  the  NLR(O)  Graph 

Our  goal  is  to  compute  continuations  for  r^R(O)  states  representing  either  shift 
or  reduce  actions  from  inconsistent  LR(0)  states.  We  xise  the  term  goal  state  to 
refer  to  any  such  NLR(O)  state.  A  goal  state  representing  a  shift  action  for  an  incon- 
sistent LR(0)  state  q  is  of  the  form  {A—*a.t/3,q),  while  a  goal  state  representing  a 
reduce  action  is  of  the  form  {A—*-uj.,q).  We  say  that  a  state  is  useful  if  it  is  an 
ancestor  of  a  goal  state. 

Since  the  computation  of  Backlog  for  a  goal  state  involves  only  edges  contained 
in  some  path  from  the  NLR(O)  start  state  to  that  goal  state,  we  may  remove  from 
the  graph  any  useless  states.  Determining  useful  states  can  be  done  by  computing 
the  SCCs  of  the  graph  (using  an  algorithm  by  Tarjan  [24])  and  forming  the  conden- 
sation. Since  the  condensation  is  acyclic  it  is  easy  to  determine  which  SCCs  are 
ancestors  of  an  SCC  containing  a  goal  state.  We  can  then  eliminate  all  NLR(O) 
states  that  are  contained  in  SCCs  that  are  not  ancestors  of  an  SCC  containing  a  goal 
state.  We  also  eliminate  the  edges  associated  with  these  states. 

Theorem  7-1:  Let  G={V,E)  be  the  graph  formed  from  an  NLR(O)  machine  M 
and  and  let  G'={V',E')  be  the  subgraph  of  G  induced  by  the  ancestors  of  the  goal 
states  of  M.  Then,  for  any  vertex  v  E  V,  Backlog (v)  computed  using  G'  is  the  same 
as  Backlog {v)  computed  using  G. 
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Proof:  We  have  defined  the  Backlog  for  an  NLR(O)  state  v  as  \J Backlog  (tt),  where 

n 

TT  is  any  path  from  the  NLR(O)  start  state  to  v.  Clearly  the  paths  to  v  are  not 
affected  by  removing  vertices  that  are  not  ancestors  of  v.  Therefore,  Backlog (v)  is 
the  same  in  either  G  or  G'.  Since  the  computation  of  Backlog  is  correct,  the  com- 
putation yields  equivalent  expressions  using  either  G  ov  G'. 

For  an  example  of  the  use  of  this  theorem  consider  grammar  G2  in  figure  7-1. 
Its  LR(0)  automaton  is  shown  in  figure  7-2  and  its  NLR(O)  automaton  is  shown  in 
figure  7-3. 

S^SBc  B^b 
5— ►a  5-^ 

Figure  7-1.  Grammar  02. 


Figure  7-2.  LR(0)  automaton  for  grammar  G2. 

There  is  only  one  inconsistent  state  in  the  LR(0)  automaton  for  G2,  state  1. 
The  conflict  in  state  1  is  a  shift  on  either  6  or  _L  conflicting  with  a  reduction  by 
5— »■€.  The  goal  states  in  its  NLR(O)  machine  are  3=(5'— »-5.±,l),  5=(5->.6,l)  and 
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I  0  S'->.5±,0  t- 
*!  1  S->.SBc.O  \- 
1    2j->..,0  I 


3  S'-'i.X.l  t- 
-H  4  S->S.Bc,l  I 

1  sfl-.t.i — I 

'I  I 


-H  7  S'-^sJITTa" 


'l  8  s-*sbTJ~ 


'l    10  B->r75~ 


'i  11  s->r76~ 


Figure  7-3.  NLR(O)  automaton  for  grammar  G2. 
6  =  (5 — ►.,1).  If  we  remove  those  NLR(O)  states  that  are  not  ancestors  for  any  of 
these  goal  states,  we  obtain  the  simpler  graph  shown  in  figure  7-4. 


I    0  S'-*.S±.0 


-»|    3  5'-»-5.X.l 


[  ^    1  S^.SBc.O     I  — »|    4  S^S.Bc.l 


*|     6  B-TE 


6  B-».,l 


Figure  7-4.  Simplified  NLR(O)  automaton  for  grammar  G2. 


Another  way  to  reduce  the  number  of  vertices  in  the  NLR(O)  graph  is  to  remove 
states  that  are  not  goal  states  and  that  contribute  only  e  to  Backlog'^. 

Definition  7-1:  An  NLR(O)  state  q  is  insignificant  if  q  is  not  a  goal  state  and  if, 
for  all  NLR(O)  states  p,  Backlog [q ,p)  =  e. 

As  previously  discussed,  an  X-transition  from  a  state  {A—*-a.X/3,q)  always  con- 
tributes £  to  Backlog,  while  an  e-transition  from  a  state  {A—*a.B0,q)  only  contri- 
butes e  to  Backlog  if  ^=e.  If  all  transitions  from  a  non-goal  state  contribute  e  to 


*  Pager  [6]  advocates  an  analogous  technique  with  his  "lane  compression." 
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Backlog,  this  state  can  be  removed  and  its  edges  can  be  carefully  replaced  without 
changing  the  result  of  Backlog  for  any  goal  state. 

For  a  non-goal  state  of  the  form  {A—*-a.t0,q),  the  only  possible  transition  is  an 
X-transition  on  the  terminal  symbol  t.  Therefore,  any  such  state  is  insignificant. 
The  other  possibility  for  contributing  only  e  is  a  state  of  the  form  {A—>^a.B,q), 
which  does  have  e-transitions,  but  has  e  following  B.  All  other  non-goal  states  are 
significant  and  are  of  the  form  {A—*'a.BP,q)  where  For  these  states  e- 

transitions  contribute  (3  to  Backlog ;  while  a  jB-transition  contributes  e. 

When  removing  insignificant  states,  we  simply  add  an  edge  from  each  contribut- 
ing state  q  to  each  significant  state  r  such  that  r  is  reachable  from  q  along  a  path 
that  consists  of  a  e-transition  (or  a  transition  on  5)  followed  by  a  sequence  of  transi- 
tions contributing  e  to  Backlog .  This  relationship  between  states  is  defined  next. 

Deflnition  7-2:  For  an  NLR(O)  machine  with  states  q  and  r,  we  say  that  q  fol- 
lows r  if  there  exists  a  path  {q,Vi,V2,  "  ■  •  where  Vj^  =r,  {q,Vi)  is  a  transition  on 
e  or  5,  and,  for  l<i  <k,  Backlog {vi,Vi^i)=e. 

This  definition  of  "follows"  captures  the  notion  that  Tail{q)  "immediately  fol- 
lows" Tail[r)  in  Continuation [r).  Of  course  there  might  be  several  states  q  that 

follow  r;  in  that  case,  \JTail{q)  "immediately  follows"  Tail{r). 

1 

After  removing  insignificant  states  and  adding  edges  from  each  significant  state 
to  all  others  that  follow  it,  we  are  left  with  a  simpler  graph  with  which  we  may  com- 
pute Backlog  correctly  for  significant  states.  Furthermore,  each  edge  in  the  resulting 
graph  (except  for  the  one  representing  a  transition  on  S)  represents  a  path  from 
some  state  {A—*-a.B^,q)  to  another  significant  state  (or  to  itself).  Backlog  for  that 
path  is  p. 

In  any  NLR(O)  machine  there  are  three  states  involving  the  production 
S'—*-Sl.:  SQ={S'—*-.SA-,qQ),  Si=(5'— ►5.±,gi)  and  S2={S'—*-SJ-.,q2).  Since  is 
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the  only  state  reachable  by  _L,  92  ^  consistent  LR(0)  state.  This  means  that  §2  is 
not  a  goal  state.  Furthermore,  §2  has  no  successors.  Therefore,  $2  is  an  insignificant 
state.  The  state  Sq  is  significant  since  it  has  e-transitions  which  contribute  ±  to 
Backlog.  There  is  a  transition  on  S  from  SqU)  Si  which  contributes  e  to  Backlog.  If 
Si  were  always  a  non-goal  state,  then  every  transition  from  Sq  would  contribute  _L 
after  removing  insignificant  states.  This  would  mean  that  every  transition  in  the 
graph  resulting  from  insignificant  state  removal  would  contribute  a  non-e  string  to 
the  backlog.  Unfortunately,  the  unambiguous  grammar  G2  in  figure  7-1  has  a 
shift/reduce  conflict  in  LR(0)  state  (state  1  in  the  figure).  Nevertheless,  this  does 
yield  a  trivial  computation  of  Backlog  for  the  edges  in  the  resulting  graph. 

For  an  example  of  this  state  removal  technique,  consider  figure  7-5a  which  dep- 
icts a  portion  of  the  NLR(O)  automaton  in  figure  4-3.  State  (5'— ►.5_L,0)  contributes 
_L  to  Backlog  and  {S—*^b.Db,2)  contributes  b.  Thus  they  are  both  significant  states. 
However  (5— ►.61)6,0)  is  insignificant  since  it  is  not  a  goal  state  and  it  has  a  terminal 
transition.  Upon  its  removal  we  add  an  edge  from  {S'—*.SJ-,0)  to  {S—*-b.Db,2) 
obtaining  the  graph  shown  in  figure  7-5b. 


0  S'-*.S±,0     I  I    0  S'-*.S±.0 


J. 


1  S-*.bDb,0  I 


6  S->b.Db,2  6  S->b.Db,2 


Figure  7-5.  Portion  of  the  NLR(O)  automaton  for  grammar  Gl. 

a)  Original;  b)  After  removing  an  insignificant  state. 


Theorem  7-2:  Let  G={V,E)  be  the  graph  formed  from  an  NLR(O)  machine  M 
and  and  let  G'={V',E')  be  defined  by  F'={v|  v  is  significant}  and  E'={{v,w)  where 
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v,weV'  and  v  follows  w}.  Then,  for  any  vertex  vSV,  Backlog {v)  computed  using 
G'  is  the  same  as  Backlog {v)  computed  using  G. 

Proof:  Assume  that  veV\  Then  by  definition  5-5  using  graph  G, 
Backlog {v)  =  U Backlog (tt)  where  tt  is  a  path  from  the  NLR(O)  start  state  Sq  to  v. 

Let  7r=(uojVi,  ■  ■  ■  Vj^)  he  one  such  path.  We  know  that  Vq  (the  start  state)  can  con- 
tribute _L  to  Backlog  and  must  be  in  V\  There  are  two  possibilities  for  Vj.  Either 
Vj  is  Sj  =  (S' — ►5._L,9i)  or     is  some  other  state. 

case  i)  Vi  =  Si. 

Here  we  know  from  the  previous  discvission  that  the  only  possible  successor  to 
is  insignificant.  Therefore,  tt  is  known  to  be  the  edge  {vq,Si).  Furthermore,  this  is 
the  only  path  from  Vq  to  s^.  Therefore  Backlog {v)  =  €,  regardless  of  which  graph  is 
used. 

case  ii)  Vit^s^. 

Let  ■  ■  ■       be  the  interior  vertices  of  tt  which  contribute  to  Backlog 

along  TT.  Clearly,  7f ={vQ,Vi^,Vi^,  ■  ■  ■  ,v^,Vjfc)  is  a  path  in  G'  such  that 
Backlog  [if) = Backlog  (tt).  Therefore  Backlog{v)  computed  using  G'  is  the  same  as 
Backlog (v)  computed  using  G. 

We  can  also  observe  that  the  previous  argument  would  work  equally  as  well 
with  G  being  the  graph  induced  by  the  ancestors  of  the  goal  states.  Therefore  we 
may  apply  either  or  both  of  these  two  theorems  to  reduce  the  number  of  vertices  in 
the  graph  used  to  compute  Backlog . 

Definition  7-3:  An  NLR(O)  automaton  is  called  reduced  if  useless  and  insig- 
nificant states  have  been  removed. 
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A  example  of  an  NLR(O)  automaton  reduced  by  applying  both  these  techniques 
is  shown  in  figure  7-6.  This  automaton  is  derived  from  the  one  depicted  in  figure  4- 
3.  All  the  original  states  are  shown  for  illustrative  purposes;  however,  there  are  no 
edges  drawn  to  insignificant  states. 


0  S'-*.S±,0 


1  S-».bDb,0 


2  S— •  bBac.O 


3  S— ►  aDa.O 


4  S->.aBbc,0 


6  S'--S.l.l~\ 


I  11  S'^S1..*\ 


1    8  S-»b.Db.2 


12  S— bD  .bT5~| 


19  S->bDb.  .11 


*!  7  S->b.B>c,a  I 


I  8  D->.a.2  I 
I  9  B->.A,2  I 
I     10  A->.t,a  I 


13  S-»bB  .ac77~] 


|20  S-»bBa.c7r2l         |26  S->bBac.  ,15| 


I  aifl-^..*n 


*!  22  D->«.  .8  I 
*!     23  A-»>.  ,8  I 


*[  14  S->a.Da,3 
*!  IS  S-»«.Bbc,8  I 


24  S-»aD.a,9 


I  27  S-^a  .  ,13 


16  B-».A,3 


126  S-Md.bc,10| 


|28  S-»aBb.c,14| 


129  S->aBbc..ie| 


Figure  7-6.  Reduced  NLR(O)  automaton  for  grammar  Gl. 
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Decomposition  of  the  NLRfO)  Graph  in  SCCs 

Tarjan  [18]  presents  several  techniques  for  dividing  a  graph  into  components  in 
order  to  reduce  the  complexity  of  computing  a  path  sequence.  The  strategy  consists 
of  splitting  the  graph  into  components,  computing  a  path  sequence  for  each  com- 
ponent and  combining  the  path  sequences  to  obtain  a  path  sequence  for  the  original 
graph.  He  first  presents  a  decomposition  based  on  SCCs.  This  technique  is  appropri- 
ate whenever  the  sizes  for  the  SCCs  are  small  with  respect  to  the  size  of  the  original 
graph.  For  a  graph  containing  one  or  more  relatively  large  SCCs,  he  develops  a 
decomposition  technique  based  on  the  dominator  tree  of  the  graph.  For  a  graph 
derived  from  the  NLR(O)  automaton  for  a  programming  language  grammar,  the 
decomposition  by  SCCs  provides  a  substantial  improvement. 

In  [18]  Tarjan  presents  a  theorem  which  provides  a  technique  for  combining  the 
path  sequences  for  SCCs  to  form  a  path  sequence  for  a  graph.  This  theorem  can  be 
modified  to  compute  homomorphism  sequences  rather  than  path  sequences. 

Theorem  7-3:  Let  G  =  {V,E)  be  the  graph  derived  from  an  NLR(O)  machine  with 
start  state  Qq  and  let  GijG^,  ■  ■  •  be  the  SCCs  of  G  ordered  such  that  {v,w)eE 
implies  vGF,-  and  wEVj  where  i<j.  For  l<i<k,  let  X,-  be  a  homomorphism 
sequence      for      G,-      and      let  be      a     sequence      consisting  of 

{{Backlog {v,w),{v,w),v,w)\{v,w)eG,veVi  and  w^V,}  ordered  arbitrarily.  Then 
Xi,Yi,X2,Y2,  •  •  •  Xig_i,Yif_i,Xii  is  a  homomorphism  sequence  for  G. 

Proof:  The  proof  is  by  induction  on  the  number  of  SCCs. 

Base  case:  Assume  that  there  is  only  one  SCC. 

Clearly,  the  defined  sequence  is  Xj,  which  is  a  homomorphism  sequence  for  G. 

Inductive  case:  Assume  that  G  has  k  SCCs  with  k  >1  and  that  the  theorem  is 
true  for  any  graph  with  ^—1  SCCs. 
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Let  G'  be  the  subgraph  of  G  induced  by  GiUG'2U  •  •  •  UG^t-i-  Clearly  G'  has 
k—1  SCCs  and,  by  assumption,  Xi,Yi,X2,Y2,  •  ■  •  Xk-2j^k-2f^k-i  a  homomor- 
phism  sequence  for  G\  Let  tt  be  any  path  in  G  starting  from  Qq.  Then  tt  can  be 
split  into  two  paths  tTj  and  where  each  tTj  is  a  path  in  G'  and  7r2=(uojVi>  '  "  "  %) 
where  Vq^^Eii  and,  for  l<t<m,  v^GEj^.  Clearly,  to  construct  this  path  from  the 
path  sequence  for  G',  we  need  to  add  an  edge  from  Vq  to  to  the  appropriate  Yj 
and  add  the  pieces  needed  from  Xj^.  Similarly,  we  need  to  add  Backlog  {vq,v^  and 
the  appropriate  homomorphic  images  of  paths  from  to  the  homomorphism 
sequence.  Therefore,  if  we  add  to  each  Y^  for  1<«  <^— 1  the  edges  from  vertices  of 
Ei  to  Ek  and  their  Backlogs,  we  see  that  Xi,Yi,X2,Y2,  ■  ■  ■  Xif_i,Yif_i,Xif  is  a 
homomorphism  sequence  for  G. 

This  decomposition  by  SCCs  can  be  applied  following  removal  of  useless  and 
insignificant  states.  In  table  7-1  we  show  the  results  of  applying  both  types  of 
improvements  to  three  grammars.  In  the  double  rows  labeled  "NLR(O)  States," 
"Largest  SCC"  and  "Expressions"  the  first  row  of  each  pair  contains  values  obtained 
from  the  original  NLR(O)  graphs,  while  the  second  row  contains  values  for  the 
NLR(O)  graphs  after  applying  all  the  improvements  discussed  in  this  section.  The 
double  row  labeled  "Expressions"  lists  the  number  of  expressions  in  a  homomorphism 
sequence  for  each  of  the  grammars. 

There  are  several  interesting  features  in  table  7-1.  First,  while  the  Pascal  gram- 
mar is  slightly  larger  than  the  Modula2  grammar,  the  Modula2  grammar  requires 
almost  twice  as  many  expressions  after  applying  the  improvements.  Also,  the  Ada 
grammar  is  about  twice  as  large  as  the  other  grammars,  but  it  requires  5  to  10  times 
as  many  expressions.  We  observe  that  there  is  a  definite  relationship  between  the 
number  of  NLR(O)  states  and  the  number  of  expressions.  There  are  about  0.015'W 
expressions  when  there  are  n  NLR(O)  states. 
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Table  7-1.  Analysis  of  Modula2,  Pascal  and  Ada 


Modula2 

Pascal 

Ada 

222 

251 

420 

TeriTiinals 

74 

67 

99 

iNonierniinais 

Q4 

loo 

LR(0)  States 

403 

404 

747 

Inconsistent 

133 

58 

163 

NLR(O)  States 

1939 

2599 

6157 

compressed 

1063 

870 

2713 

Largest  SCC 

293 

416 

1023 

compressed 

35 

110 

427 

Expressions 

38067 

107050 

376013 

compressed 

18582 

10309 

106813 

An  implementation  of  the  Backlog  algorithm  might  use  about  32  bytes  per 
expression.  This  would  mean  that  space  requirement  for  Ada  has  been  reduced  from 
about  12  million  bytes  to  about  4  million.  While  both  figures  are  fairly  large,  the 
difference  is  substantial  enough  to  make  the  algorithm  feasible  on  computers  with 
moderate  address  spaces. 

We  have  defined  useful  NLR(O)  states  and  we  have  shown  a  technique  to 
remove  useless  states  from  the  NLR(O)  graph.  In  addition,  we  have  defined  insignifi- 
cant NLR(O)  states  and  we  have  shown  how  to  remove  these  from  the  NLR(O)  graph. 
We  have  shown  a  third  improvement  to  the  Backlog  algorithm  based  on  splitting  the 
NLR(O)  graph  into  strongly  connected  components.  All  of  these  improvements  can 
be  applied  to  reduce  substantially  the  number  of  expressions  in  the  homomorphism 
sequence. 


CHAPTER  8 

USING  CONTINUATION  LANGUAGES  FOR  LOOKAHEAD 

Continuation  languages  can  be  used  to  produce  fc-symbol  lookahead  and  to  pro- 
duce regular  lookahead.  In  addition,  it  is  possible  to  use  continuation  languages  to 
prove  that  arbitrary  lookahead  is  required  for  some  grammars.  Furthermore,  in 
some  instances,  continuation  languages  can  be  used  to  show  that  no  form  of  look- 
ahead  can  be  added  to  an  LR(0)  machine  to  resolve  conflicts. 

Determining  Whether  Adding  Lookahead  Will  Help 

There  exist  grammars  (typically  ambiguous)  for  which  there  is  no  form  of  look- 
ahead  which  can  be  added  to  the  LR(0)  machine  to  resolve  conflicts.  Consider  gram- 
mar G3  whose  LR(0)  machine  is  in  figure  8-2.  Its  NLR(O)  machine,  after  removing 
useless  and  insignificant  states,  is  in  figure  8-3. 

S-^aAd     S-^-bAe     A-^-Ac  B^Bc 
S^aBe      S^bBd     A—>-c  B^c 

Figure  8-1.  Grammar  G3,  which  defies  adding  lookahead. 

In  the  LR(0)  machine  for  G3,  state  5  is  inconsistent.  Corresponding  to  the 
reduce  actions  for  LR(0)  state  5,  are  NLR(O)  states  5  and  6.  The  continuation 
languages  for  NLR(O)  states  5  and  6  are  both  c*{</-|-e)_L.  Therefore  any  valid  con- 
tinuation for  state  5  is  also  a  continuation  for  state  6  and  vice  versa.  Hence,  there  is 
no  type  of  lookahead  that  would  resolve  this  conflict.  However,  G3  is  tmambiguous. 
The  conflict  could  be  resolved  either  by  starting  with  an  LR(l)  parser  or  by  looking 
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Figure  8-2.  LR(0)  Machine  for  G3. 
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I  0  s-—.s±,o  I 


"l   1  S-«.Be.2  I 
i 

[     I    2  B->.Bc,2  I 

*!    8  S-O.M.2  I 

[    1    4  A-kAc.2  I 


1  S  A-»c.  ,5  I 
'l      6  B-»c.  .5  I 


'l  7  S->t).B<l,l  I 

[    'l  8  B-jlflcl  I 

"I  9  S-»b.A«.l  I 

[    *!  10  A-i.Ac.l  I 


Figure  8-3.  Reduced  NLR(O)  automaton  for  grammar  G3. 


back  on  the  LR(0)  parse  stack  to  determine  whether  state  1  or  state  2  was  entered 
from  state  0. 

In  general,  for  an  LR(0)  machine  with  a  conflict  in  state  q ,  there  will  be  at  least 
two  NLR(O)  states  r  ={A—*-oc.^,q)  and  s  ={B—*-6.''),q)  that  represent  conflicting 
actions.  We  assume  that  there  are  exactly  two  in  this  discussion,  since  more  than 
two  states  can  be  analyzed  in  a  pairwise  fashion.  K  Continuation  (r)  PI 
Continuation (s)  ^  <f>,  then  there  is  no  hope  that  adding  lookahead  to  the  LR(0) 
machine  will  resolve  the  conflict. 
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Determining  Whether  Arbitrary  Lookahead  is  Required 


We  present  a  technique  that  can  tell,  for  some  grammars,  if  the  grammar's 
LR(0)  conflicts  cannot  be  resolved  by  adding  A;-symbol  lookahead  for  any  value  of  k. 
It  is  known  that  for  an  arbitrary  grammar,  it  is  undecidable  whether  the  grammar  is 
LALR(A;)  for  some  value  of  k  [25].  However,  it  is  decidable  for  some  grammars. 

Consider  the  arithmetic  expression/set  expression  grammar  from  figure  1-3, 
shown  again  in  figure  8-4.  The  LR(0)  automaton  for  this  grammar  has  10  incon- 
sistent states.  Most  of  these  states  have  conflicts  resolvable  using  single-symbol  look- 
ahead.  However,  two  LR(0)  states  require  arbitrary  lookahead  to  resolve  their  con- 
flicts. This  can  be  seen  from  examination  of  the  portion  of  the  reduced  NLR(O) 
machine  for  the  grammar  shown  in  figure  8-5. 

S  -►AE  =  AE  S  -►SEsSE 

AE-^AE-hAT  SE-^SE-I-ST 

AE     AE  -  AT  SE      SE  -  ST 

AE-^AT  SE-*ST 

AT     AT  *  AF  ST      ST  *  SF 

AT-*AF  ST-^SF 

AF     id  SF  id 

AF  — ►  const  SF  —*■  const 

Figiire  8-4.  Arithmetic  expression/set  expression  grammar. 

LR(0)  state  3  for  the  AE/SE  grammar  has  a  reduce/reduce  conflict  with  reduc- 
tions by  SF—*^id  and  AF—*id.  Since  the  productions  for  arithmetic  expressions  are 
isomorphic  to  those  for  set  expressions,  figure  8-3  shows  just  the  portion  of  the 
NLR(O)  automaton  that  contributes  to  the  Backlog  for  state  {SF—*'id.,3).  We 
observe  that  Continuation  (SF-*- id., 3)  =  [*SF)\{-\-\-)ST)*  =SE1.  ^  .  Similarly, 


^  Since  "+"  G  T ,  we  use  I  rather  than  the  regular  expression  operator  "+' 
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S'-¥.S±.0 

f  >1  1 

SF-^id..3 

> 

> 

S—.SE  =  SE.O 

f 

SE-t'.SE  +  ST.0 

f 

\ 

f 

>- 

SE-f.SE  -ST.O 

\ 

f 

( 

ST-i-.ST  *  SF.O 

Figure  8-5.  Part  of  the  reduced  NLR(O)  automaton  for  AE/SE  grammar. 

Continuation {AF^ id. ,3)  =  {*AF)*{{+\-)AT)*=AE±.  Since  SF  and  AF  both 
generate  {id, const},  we  observe  that  both  continuation  languages  begin  with 
equivalent  starred  expressions.  In  fact,  the  second  starred  expressions  of  the  two  con- 
tinuations are  also  equivalent.  The  only  real  difference  is  that  the  first  has  "=", 
whereas  the  second  has  "  =  ".  Clearly  "='  and  "  =  "  can  be  preceded  by  equivalent 
strings  of  arbitrary  length.  Successful  lookahead  requires  looking  arbitrarily  far 
ahead  to  find  either  "="  or  "  =  ". 

In  general,  for  an  NLR(O)  machine  with  states  r  and  s  representing  two  con- 
flicting actions  for  an  inconsistent  LR(0)  state,  if  Continuation  [r]  =  a/3^i+6  and 
Continuation (s)  =  a^2'^  where  /?  is  either  a  starred  expression  or  an  expression 
containing  a  nonterminal  generating  an  infinite  language,  then  finite  lookahead  can 
not  resolve  the  conflict  represented  by  states  r  and  s . 

Using  Continuation  Languages  for  A: -symbol  Lookahead 


Continuation  languages  may  be  used  to  calculate  ^-symbol  lookahead  strings  by 
computing  Firstly  for  each  NLR(O)  state  representing  a  conflicting  action  for  an 
inconsistent  LR(0)  state,  where  Firstij(Q!)  =  {x|  3j/   with  xyeL{a)  and  either 
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length{x)=k  or  length{y)=0}.  For  instance,  the  grammar  G4  shown  in  figure  8-6 
requires  3-symbol  lookahead  to  resolve  the  conflict  in  LR(0)  state  1. 

S-^ABC  B-^bd 
S— ►aBc  B— ►be 
A      a  C  d 

Figure  8-6.  Grammar  04. 


V_yA-^a   ^  ^^^S^aBc 

Figure  8-7.  LR(0)  automaton  for  grammar  04. 

In  the  LR(0)  automaton  in  figure  8-7,  there  is  a  conflict  in  state  1.  From  state  1 
a  reduction  is  possible  by  A— ►a  and  a  shift  on  b  to  state  5.  There  are  3  NLR(O) 
states  representing  LR(0)  actions.  State  (A— ►a.,!)  represents  a  reduction  and  states 
(5— ►.6</,l)  and  {B—*.be,l)  represent  a  shift  on  6.  Continuation  analysis  shows  that 
Continuation  {A—*' a., 1)     =     BCJ.,     Continuation  {B—*-.bd,l)     =     bdcl.  and 
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Continuation  [B—*'. be, l)  =  bee  A..  First^{BCl.)  =  {bdd,bed}.  If  we  combine  the 
First^  sets  for  the  shift  action,  we  obtain  {bdc,bec\  We  see  that  these  sets  are  dis- 
joint and  that  they  can  be  successfully  used  for  lookahead. 

The  lookahead  strings  computed  by  determining  First^  of  continuation 
languages  are  exactly  the  same  lookahead  strings  as  used  by  a  LALR(A;)  parser.  A 
grammar  is  defined  to  be  LALR(A;)  if  its  LR(0)  conflicts  can  be  resolved  by  adding 
the  necessary  ^-symbol  lookahead  strings  to  the  LR(0)  items  [26].  The  necessary 
lookahead  strings  are  those  which  can  legitimately  follow  the  portion  of  the  sentence 
parsed  to  reach  an  LR(0)  state.  This  is  exactly  what  Firstj^  of  continuation  languages 
yields.  Therefore  a  parser  built  by  adding  Firstj^  of  continuation  languages  is  an 
LALR(^)  parser. 

Using  Continuation  Languages  for  Regular  Lookahead 

We  present  several  methods  for  using  continuation  languages  for  regular  look- 
ahead.  First  we  present  a  method  for  transforming  the  ECFG  for  a  continuation 
language  into  an  equivalent  regular  expression.  Since  a  continuation  language  can  be 
any  arbitrary  CFL,  this  will  only  work  for  right-linear  or  left-linear  grammars.  For 
non-linear  grammars,  we  present  a  method  for  constructing  regular  supersets  of  con- 
tinuation languages. 

In  a  continuation  language  a  for  an  NLR(O)  state  q ,  there  may  be  several  non- 
recursive  nonterminals.  The  first  step  in  using  continuation  languages  for  regular 
lookahead  is  to  replace  any  such  nonterminal  with  the  union  of  the  right  hand  sides 
of  all  productions  having  that  nonterminal  on  the  left  hand  side.  This  process  will 
need  to  be  repeated  until  all  non-recursive  nonterminals  have  been  replaced.  It  is 
possible  that,  after  this  is  done,  a  will  be  transformed  into  a  regular  expression  /? 
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containing  only  terminal  symbols.  If  so,  then  use  /?  for  lookahead  for  the  action  asso- 
ciated with  state  q .  For  an  example  of  this  type  of  lookahead  consider  grammar  G5 
shown  in  figure  8-8  and  its  LR(0)  automaton  shown  in  figure  8-9. 

S-t-Aa  S^Bb  C^ba 
A-^AC  B^BD  D—'ba 
A^C  B-^D 

Figure  8-8.  Grammar  05. 

There  is  a  reduce/reduce  conflict  in  state  15  for  the  LR(0)  automaton  for  05. 
We  observe  that  Continuation [C — ►ftojlS)  =  C*a_L  and  Continuation [D — ►6a,15) 
=  D*6_L.  Clearly  Z/(C)=L(D)=  6a.  Therefore  Continuation{C—*ba,\b)={ba^al. 
and  Continuation[D  —*'ba ,\b)={ba^ b These  regular  expressions  are  sufficient  for 
resolving  the  conflict. 

In  a  continuation  language  a  for  an  NLR(O)  state  q ,  there  might  be  several  non- 
terminals easily  identifiable  as  generating  regular  languages.  After  replacing  all  non- 
recursive  nonterminals,  the  next  step  is  to  replace  all  nonterminals  A  where  is 
either  right  or  left-linear  by  L  {A ).  Since  linear  grammars  generate  regular  languages 
[14],  L{A)  can  be  expressed  as  a  regular  expression.  If  a  has  been  transformed  into  a 
regular  expression    with  no  nonterminals,  then  /?  may  be  used  for  lookahead. 

The  arithmetic  expression/set  expression  grammar  in  figure  8-4  illustrates  this 
process.  The  significant  conflict  in  that  grammar  is  a  reduce/reduce  conflict  in 
LR(0)  state  3.  Recall  that 

Continuation{SF^id.  ,2)  =  {*SF)\{-ir\-)ST)*^El. 

and 

Continuation{AF^id.,Z)  =  {*AF)*{{■\-\-)AT)*^AE^.. 
C\&^v\y,  L{SF)  =  L{AF)=id\const.  We  observe  that  ST  and  AT  are  left-linear  and 
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0    )  ^  /  1   =L_^  2 


D 


C 


B^D 


V  i. 


D 


A^AC 


S^Aa 


B^BD 


S^Bb 


C^ba 


^« — yn^ba 


C^ba 
D-fba 

Figure  8-9.  LR(0)  automaton  for  G5. 

their  grammars  generate  (id  \const){*{id  \const))*.  We  can  substitute  these  expres- 
sions into  the  continuations  to  obtain 

Continuation {SF^ id.  ,3)  =  {*{id  \const))*{{+\-){id  \const){*{id  \const))y=SE± 

and 
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Continuation{AF—*-id.  ,3)  =  {*{id  \const))*{{+\—){id  \const){* {id  \const))*)*=^AEA-. 

These  expressions  are  alike  prior  to  "="  in  the  first  and  "  =  "  in  the  second.  Thus, 
lookahead  must  continue  until  one  of  these  two  symbols  is  reached.  It  doesn't  matter 
what  would  replace  SE  and  AE. 

Having  performed  the  two  previous  transformations  on  a  continuation  language 
a  for  an  NLR(O)  state  q,  there  is  one  more  simple  transformation  that  can  be 
attempted.  Replace  each  nonterminal  A  in  a  by  .  Recall  that  includes  each 
terminal  in  T  that  appears  in  a  string  derived  from  A .  The  two  previous  transfor- 
mations replaced  a  with  equivalent  expressions.  The  replacement  of  A  with  T4* 
replaces  a  with  a  regular  superset  of  itself.  For  an  example  consider  grammar  G6 
shown  in  figure  8-10  and  its  LR(0)  automaton  shown  in  figure  8-11. 

S-^-ACa     S^BCb  C^aCb 
A^a         B-->-a  C-^ 

Figure  8-10.  Grammar  06. 

There  is  a  reduce/reduce  conflict  in  state  12  in  figure  8-11  that  requires  arbi- 
trary lookahead.  We  observe  that  Continuation{A—*-a,\2)=Cal.  and 
Continuation{B—*a,\2)  =  ChA-.  TQ  =  {a,b\  If  we  replace  C  in  these  expressions  by 
Tq*,  we  obtain  the  sufficient  lookahead  expressions  (a-|-6)*a_L  and  (a-|-6)*6_L. 

Strategy  for  Continuation  Analysis 

We  have  presented  several  useful  tests  and  lookahead  techniques  that  can  be 
combined  to  form  a  strategy  for  analyzing  continuation  languages  for  lookahead  util- 
ity. Despite  the  undecidability  of  LALR  membership  [25]  and  LR-Regular  member- 
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Figure  8-11.  LR(0)  machine  for  G6. 

ship  [2],  this  strategy  provides  a  viseful  technique  for  adding  lookahead  to  an  LR(0) 
parser. 

The  first  step  is  to  test  whether  the  grammar  has  an  intractable  conflict.  This 
test,  described  in  the  first  subsection  of  this  chapter,  is  not  guaranteed  to  detect  all 
intractable  conflicts.  However,  for  some  grammars,  it  can  show  conclusively  that 
there  is  no  form  of  lookahead  that  can  resolve  the  conflict.  In  such  a  case,  the  gram- 
mar must  be  modified  to  resolve  the  conflict. 

The  second  step,  described  in  the  second  subsection  of  this  chapter,  is  to  deter- 
mine whether  arbitrary  lookahead  is  required.  Failure  of  this  test  does  not  guarantee 
that  finite  lookahead  will  resolve  the  conflict.  To  do  so  would  violate  the  imdecida- 
bility  of  LALR.  Instead,  it  can  be  useful  for  some  grammars  to  show  that  finite 
lookahead  will  not  resolve  the  conflict. 
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If  there  is  no  intractable  conflict  detected  for  a  grammar  G  and  the  test  to  rule 
out  finite  lookahead  is  inconclusive,  the  next  step  is  to  try  finite  lookahead.  This 
should  be  done  by  starting  with  k  =  l  and  testing  progressively  larger  values  for  k. 
Since  it  is  undecidable  whether  there  exists  a  k  such  that  G  is  LALR(A;)  [25],  we  can- 
not attempt  values  of  k  until  we  succeed.  Instead,  we  must  select  some  arbitrary 
maximum  value  for  k.  When  this  maximum  value  for  k  has  been  attempted  and 
fails  to  resolve  the  conflict,  we  must  abandon  the  search  and  try  regular  lookahead. 

The  final  step  is  to  try  the  various  methods  presented  for  adding  regular  look- 
ahead.  These  methods  begin  by  replacing  all  non-recursive  nonterminals  with  the 
finite  languages  they  generate.  This  could  be  sufficient  to  resolve  the  conflict.  Next, 
all  right-linear  (or  left-linear)  nonterminals  are  replaced  by  the  regular  languages 
they  generate.  If  this  fails  to  resolve  the  conflict,  the  final  method  for  adding  regular 
lookahead  is  to  replace  each  nonterminal  A  by  T4  .  This  will  yield  regular  expres- 
sions of  terminal  symbols.  If  this  fails  to  resolve  the  conflict,  the  grammar  needs 
modification. 

In  this  chapter  we  have  shown  that  continuation  languages  can  be  used  for  k- 
symbol  lookahead  and  that  this  technique  is  equivalent  to  LALR(^).  We  have  also 
shown  a  technique  that  can,  for  some  grammars,  indicate  whether  the  grammar  is 
L7\LR(Ar)  for  any  k.  Furthermore,  we  have  shown  a  technique  which  can,  for  some 
grammars,  indicate  that  not  even  arbitrary  lookahead  can  resolve  conflicts.  Finally 
we  have  shown  that  continuation  languages  can  be  profitably  used  to  add  regular 
lookahead  to  an  LR(0)  parser. 


CHAPTER  9 
CONCLUSIONS  AND  FUTURE  RESEARCH 


We  have  developed  an  algorithm  to  compute  continuation  languages  and  we 
have  developed  refinements  of  that  algorithm  to  improve  its  efficiency.  This  refined 
algorithm  can  be  applied  to  modern  programming  language  grammars  to  calculate 
continuation  languages  for  NLR(O)  goal  states. 

We  have  shown  that  continuation  languages  can  be  expressed  as  regular  expres- 
sions of  terminal  and  nonterminal  symbols.  Generally  these  regular  expressions  tend 
to  be  fairly  complex  and  we  have  suggested  practical  methods  based  on  several  regu- 
lar expression  identities  for  simplifying  them.  We  have  shown  that  regular  expression 
simplification  is  akin  to  the  containment  and  star  height  problems  for  regular 
languages.  Based  on  the  intractability  of  these  problems,  we  conclude  that  there  is 
likely  to  be  no  method  that  can  efficiently  perform  simplifications  based  on  the  regu- 
lar expression  identities  we  have  presented.  Consequently,  we  suggest  that  regular 
expression  simplification  be  used  primarily  for  presentation  to  people  and  not  for 
computing  lookahead  based  on  continuation  languages. 

Previo\is  research  efforts  in  LR  lookahead  computation  have  avoided  \jsing  con- 
tinuation languages.  While  there  have  been  numerous  useful  techniques  developed, 
they  have  all  been  based  on  some  form  of  simplification  of  the  underlying  continua- 
tion languages.  We  propose  that  lookahead  computation  can  be  best  understood  by  a 
careful  analysis  of  these  continuation  languages.  The  research  presented  here  shows 
that  a  significant  portion  of  this  analysis  can  be  performed  automatically. 
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We  have  developed  techniques  for  single  symbol,  multiple  symbol,  and  regular 
lookahead.  We  have  shown  that  mdtiple  symbol  lookahead  computation  based  on 
continuation  languages  is  as  powerful  as  LALR(A;). 

There  are  several  avenues  of  research  opened  by  this  dissertation.  Research 
could  be  done  to  improve  the  efficiency  of  the  algorithm  even  more  than  was  done 
here,  to  develop  better  expression  simplification  heuristics  and  to  apply  the  same 
techniques  to  LL  parsing. 

Improving  the  Algorithm 

The  continuation  algorithm  presented  here  is  fairly  efficient,  but  it  tends  to  pro- 
duce fairly  complex  regular  expressions.  Many  programming  language  grammars 
tend  to  make  use  of  some  common  patterns  to  represent  structures  such  as  expres- 
sions or  parameter  lists.  An  analysis  of  some  of  these  common  patterns  might  reveal 
their  effects  on  continuation  languages.  These  effects  might  then  be  incorporated 
into  the  algorithm  to  produce  simpler  expressions  from  the  outset. 

The  effort  to  simplify  regular  expressions  was  only  partially  successful  due  to 
the  basic  intractability  of  the  problem.  However,  this  does  not  preclude  the  possibil- 
ity of  doing  an  adequate  job  for  common  programming  language  continuation  pat- 
terns. Some  effort  could  be  spent  studying  these  patterns  and  designing  new  heuris- 
tics for  detecting  and  simplifying  them. 
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Using  Continuation  Languages  for  LL  Parsing 

We  have  seen  that  continuation  languages  can  be  used  for  lookahead  in  LR 
parsing.  A  similar  study  could  be  made  of  continuation  languages  for  LL  parsing. 
The  underlying  graph  from  which  continuation  languages  would  be  derived  would 
consist  of  items  of  the  form  A—*a.^.  The  same  types  of  transitions  would  apply  and 
the  algorithm  used  would  be  essentially  the  same. 

Computing  continuation  languages  for  LL  parsing  would  be  significantly  easier 
than  for  LR  parsing,  since  the  underlying  graph  would  contain  far  fewer  vertices. 
Techniques  for  analyzing  these  continuation  languages  could  indicate  that  certain 
grammars  have  unresolvable  problems  or  that  finite  lookahead  would  not  work. 
Similar  techniques  to  those  for  LR  parsing  could  be  used  to  derive  finite  and  regular 
lookahead  from  the  continuation  languages.  This  would  combine  LR  and  LL  look- 
ahead  theory. 

The  primary  importance  of  this  research  is  the  development  of  single  symbol, 
multiple  symbol,  and  arbitrary  lookahead  techniques,  all  using  the  same  theoretical 
basis.  This  research  provides  useful  lookahead  computation  techniques  while  provid- 
ing a  unified  theory  of  lookahead  computation. 


REFERENCES 


1.  F.L.  DeRemer,  T.J.  Pennello,  and  W.M.  McKeeman,  "Ada  Syntax  Chart," 
SIGPLAN  Notices,  vol.  16,  no.  9,  pp.  48-59,  Sept.  1981. 

2.  K.  Culik  and  R.  Cohen,  "LR-Regular  Grammars  -  an  Extension  of  LR(k) 
Grammars,"  /.  Comput.  Syst.  Sci,  vol.  7,  no.  1,  pp.  66-96,  1973. 

3.  D.E.  Knuth,  "On  the  Translation  of  Languages  from  Left  to  Right,"  Informa- 
tion and  Control,  vol.  8,  no.  6,  pp.  607-639,  1965. 

4.  A.J.  Korenjak,  "A  Practical  Method  for  Constructing  LR(k)  Processors," 
Comm.  ACM,  vol.  12,  no.  11,  pp.  613-623,  1969. 

5.  D.  Pager,  "A  Practical  General  Method  for  Constructing  LR(k)  Parsers,"  Acta 
Informatica,  vol.  7,  no.  3,  pp.  249-268,  1977. 

6.  D.  Pager,  "The  Lane-Tracing  Algorithm  for  Constructing  LR(k)  Parsers  and 
Ways  of  Enhancing  its  Efficiency,"  Inf.  Sci.,  vol.  12,  no.  1,  pp.  19-42,  1977. 

7.  David  Spector,  "Efficient  Full  LR(l)  Parser  Generation,"  SIGPLAN  Notices, 
vol.  23,  no.  12,  pp.  143-150-175,  Dec.  1988. 

8.  F.L.  DeRemer,  "Simple  LR(k)  Grammars,"  Comm.  ACM,  vol.  14,  no.  7,  pp. 
453-460,  July  1971. 


9.  F.  DeRemer  and  T.J.  Pennello,  "Efficient  Computation  of  LALR(l)  Look-ahead 
Sets,"  ACM  TOPLAS,  vol.  4,  no.  4,  pp.  615-649,  Oct.  1982. 

10.  T.P.  Baker,  "Extending  Lookahead  for  LR  Parsers,"  Journal  of  Computer  and 
System  Sciences,  vol.  22,  no.  2,  pp.  243-259,  Apr.  1981. 

11.  M.E.  Bermudez  and  K.  M.  Schimpf,  "A  Practical  Arbitrary  Look-ahead  LR 
Parsing  Technique,"  Proceedings  of  the  SIGPLAN  '86  Symposium  on  Compiler 
Construction,  pp.  136-144,  ACM  Press,  Boston,  MA,  Jun.  1986. 

12.  M.E.  Bermudez  and  K.  M.  Schimpf,  "Practical  Arbitrary  Lookahead  LR  Pars- 
ing," Journal  of  Computer  and  System  Sciences,  to  appear. 

13.  A.V.  Aho,  J.E.  Hopcroft,  and  J.D.  Ullman,  The  Design  and  Analysis  of  Com- 
puter Algorithms,  Addison-Wesley,  Reading,  MA,  1974. 


74 


75 


14.  J.E.  Hopcroft  and  J.D.  UUman,  Introduction  to  Automata  Theory,  Languages, 
and  Computation,  Addison-Wesley,  Reading,  MA,  1979. 

15.  A.V.  Aho  and  J.D.  Ullman,  The  Theory  of  Parsing,  Translation  and  Compil- 
ing, vols  I  and  II,  Prentice-Hall,  Inc.,  Englewood  Cliffs,  NJ,  1972. 

16.  O.L.  Madsen  and  B.B.  Kristensen,  "LR-Parsing  of  Extended  Context  Free 
Grammars,"  Acta  Informatica,  vol.  7,  no.  1,  pp.  61-73,  1976. 

17.  S.C.  Kleene,  "Representation  of  Events  in  Nerve  Nets  and  Finite  Automata,"  in 
Automata  Studies,  ed.  J.  McCarthy,  pp.  3-40,  Princeton  University  Press, 
Princeton,  NJ,  1956. 

18.  R.E.  Tarjan,  "Fast  Algorithms  for  Solving  Path  Problems,"  Journal  of  the 
ACM,  vol.  28,  no.  3,  pp.  594-614,  July  1981. 

19.  H.B.  Hunt  EI,  D.J.  Rosenkrantz,  and  T.G.  Szymanski,  "On  the  Equivalence, 
Containment,  and  Covering  Problems  for  the  Regular  and  Context-Free 
Languages,"  Journal  of  Computer  and  System  Sciences,  vol.  12,  no.  2,  pp.  222- 
268,  April  1976. 

20.  Arto  Salomaa,  Jewels  of  Formal  Language  Theory,  Computer  Science  Press, 
Inc.,  Rockville,  MD,  1981. 

21.  Arto  Salomaa,  Formal  Languages,  Academic  Press,  New  York,  1973. 

22.  Farit  Validov,  "On  the  Standard  Star  Height  of  Regular  Sets.  I,"  Journal  of 
Information  Processing  and  Cybernetics,  vol.  22,  no.  9,  pp.  463-473,  1986. 

23.  J.R.  Bunch  and  D.J.  Rose,  "Partitioning,  Tearing  and  Modification  of  Sparse 
Linear  Systems,"  Journal  of  Mathematical  Analysis  and  Applications,  vol.  48, 
no.  2,  pp.  574-593,  1974. 

24.  R.E.  Tarjan,  "Depth-First  Search  and  Linear  Graph  Algorithms,"  SIAM  Jour- 
nal of  Computing,  vol.  1,  no.  2,  pp.  146-160,  June  1972. 

25.  H.B.  Hunt  EI  and  T.G.  Szymanski,  "Complexity  Metatheorem  for  Context-free 
Grammar  Problems,"  Journal  of  Computer  and  System  Sciences,  vol.  13,  no.  3, 
pp.  318-334,  1976. 

26.  B.B.  Kristensen  and  O.L.  Madsen,  "Methods  for  Computing  LALR(k)  Look- 
ahead,"  ACM  Trans,  on  Programming  Languages  and  Systems,  vol.  3,  no.  1, 
pp.  60-82,  Jan.  1981. 


APPENDIX 


/* 

*  This  is  a  function  to  see  if  one  expression  (el)  contains  another  (e2). 

V 

int  Contains  (  el,  e2,  first,  last  ) 
expr  el,  e2; 
int  first,  last; 

int  etl,  et2; 
int  i,  n; 

int  Found,  NullCount; 
expr  c,  h,  t; 

/* 

*  The  types  could  be  handy. 

V 

etl  =  Expr  Type  (  el  ); 
et2  =  ExprType  (  e2  ); 

if  (  (et2  =  CATNODE)  &&  (first  <  last) )  { 
et2  =  EMPTY; 
e2  =  EMPTY; 

} 

/* 

*  First  the  trivial  cases. 
*/ 

if  (  Equal(el  ,  e2)  &&  (first  =  0)  && 

(last  =  (ExprRank(el)-l))  )  return  1; 
if  (  Equal(e2  ,  EMPTY)  )  return  Member(Nullable,el); 


/* 

* 

V 


Switch  based  on  el's  type, 
switch  (  etl  )  { 


/* 

*       EMPTYSET,  eps  and  symbols  can  only  contain  themselves. 

V 

case  EMPTYSET: 
case  EMPTY: 
case  SYMBOLNODE: 
return  0; 
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For  a  STAR,  consider  the  alternatives  for  e2: 
e2  is  a  SYMBOL:  e2  must  be  in  SubExpr(el,0) 
e2  is  an  OR:    each  child  of  e2  must  be  in  el 
e2  is  a  STAR:    SubExpr(e2,0)  must  be  in  el 
e2  is  a  CAT:    A  Mess! 

First,  try  finding  all  of  e2  in  SubExprfel,Oj 

Then  try  splitting  e2  into  2  pieces  and  look 

for  the  first  peice  in  SubExpr(el,0)  and  the 

second  in  el.  Split  e2  in  all  positions  before 

giving  up. 

case  STARNODE: 

switch  (  et2  )  ( 

case  SYMBOLNODE: 

return  Contains  (  SubExpr(el,0),  e2,  first,  last ); 
case  ORNODE: 

n  =  ExprRank  (  e2  ); 
for  ( i  =  0;  i  <  n;  i-H- )  { 
c  =  SubExpr(e2,i); 

if  (  !Contains(el,c,0,ExprRank(c)-l)  )  return  0; 

return  1; 
case  STARNODE: 

c  =  SubExpr(e2,0); 

return  Contains  (  el,  c,  0,  ExprRank(c)-l  ); 
case  CATNODE: 

c  =  SubExpr  {  el,  0  ); 

if  {  Contains  (  c,  e2,  first,  last ) )  return  1; 

for  ( i  =  first;  i  <  last;  i-H- )  I 

if  (  Containsfel,e2,first,i)  && 

Contains(el,e2,i-|-l,last)  )  return  1; 

return  0; 

} 


For  an  OR,  consider  2  cases: 

e2  is  an  OR:  every  child  of  e2  must  be  in  el. 
otherwise:    e2  must  be  in  some  child  of  el. 

case  ORNODE: 

if  (  et2  =  ORNODE  ]  { 

n  =  ExprRank  (  e2  ); 
for  ( i  =  0;  i  <  n;  i-H- )  { 
c  =  SubExpr(e2,i); 

if  (  !Contains(el,c,0,ExprRank(c)-l)  )  return  0; 
return  1; 

}else{ 

n  =  ExprRank  (  el  ); 
for(i=0;i  <  n;i-H-){ 

if  (  Contains(SubExpr(el,i),e2,first,last)  ) 
return  1; 
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} 

return  0; 


/* 

*  For  a  CAT,  consider  the  cases  for  e2: 

*  e2  is  a  SYMBOL:  find  e2  in  1  child  of  el  and  the  rest  nullable. 

*  e2  is  a  STAR:    find  e2  in  1  child  of  el  and  the  rest  nullable. 

*  e2  is  an  OR:     each  child  of  e2  miast  be  contained  in  el. 

*  e2  is  a  CAT:     The  Worst  Mess! 

*  Use  function  TailContains  to  see  if  the  tail 

*  of  el  contains  the  tail  of  e2  with  the  tails 

*  starting  as  the  whole  expressions. 

*  I 

case  CATNODE: 

switch  (  et2  )  { 
case  SYMBOLNODE: 
case  STARNODE: 
Found  =  0; 
NullCount  =  0; 
n  =  ExprRank  (  el  ); 
for  ( i  =  0;  i  <  n;  i-H- )  { 

c  =  SubExpr  (  el,  i  ); 
if  (  Contains  (  c,  e2,  first,  last  )  )  { 
Found-H-; 

if  (  Member(Nullable,c)  )  NullCount 
}  else  { 

if  (  !Member(Nullable,c)  )  return  0; 

}  ^ 

return  (Found  &&  ((Found-NullCount)  <  2)  ); 
case  ORNODE: 

n  =  ExprRank  (  e2  ); 
for  ( i  =  0;  i  <  n;  i-H- )  { 
c  =  SubExpr(e2,i); 

if  (  !Contains(el,c,0,ExprRank(c)-l)  )  return  0; 

return  1; 
case  CATNODE: 

return  TailContains  (  el,  0,  ExprRank(el)-l, 
e2,  first,  last ); 

}  ^ 

retvirn  0; 


This  is  a  recursive  function  to  try  all  possible  patterns  for 
*       matching  the  tails  of  2  CAT  expressions. 

V 

int  TailContains  (  el,  il,  rl,  e2,  12,  r2  ) 
expr  el,  e2; 
int  il,  rl,  i2,  r2; 
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int  i,  et; 
expr  e; 


Look  for  empty  tail  of  e2.  When  e2  is  consumed  and  the  tail 
of  el  is  nullable,  the  original  e2  is  in  el. 

if(i2>r2){ 

for(i=il;i  <=rl;i++){ 

if  (  !Member(Nullable,SubExpr(el,i)) )  retiirn  0; 

^        return  1; 


Now,  look  for  empty  tail  of  el.  When  el  is  consumed  and  e2 
is  not,  then  e2  is  not  in  el. 

if  (  il  >  rl  )  return  0; 


OK.  Get  the  first  child  of  el's  tail.  If  this  is  not  a  STAR, 
then  it  must  be  a  SYMBOL  and  must  have  an  exact  match  in  the 
tail  of  e2.  If  it  is  a  STAR,  then  we  can  try  to  match  that 
STAR  with  any  prefix  of  e2's  tail. 

e  =  SubExpr  (  el,  il  ); 
et  =  ExprType  (  e  ); 

if  (  et  !=  STARNODE  )  { 

return  (  (Equal(e  ,  SubExpr(e2,i2)))  && 

TailContains(el,il+l,rl,e2,i2+l,r2)  ); 

}  else  { 

if  (  TailContains(el,iH-l,rl,e2,i2,r2)  )  return  1; 
for  ( i  =i2;  i  <=r2;  i-H-)  { 

if  (  Contains(e,e2,i2,i)  && 

TailContains(el,iH-l,rl,e2,i+l,r2)  )  return  1; 

return  0; 
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